NOTICE 


THIS DOCUMENT HAS BEEN REPRODUCED FROM 
MICROFICHE. ALTHOUGH IT IS RECOGNIZED THAT 
CERTAIN PORTIONS ARE ILLEGIBLE, IT IS BEING RELEASED 
IN THE INTEREST OF MAKING AVAILABLE AS MUCH 
INFORMATION AS POSSIBLE 



8.o- 1 a 2 6. a 


^MMe^able under NASA sponsorship 

In the interest of early and wide dis- 
semination of Earth Resources Survey 
Program information and without liability 
for any use made thereof." 


Use of Collateral Information to 
Improve Landsat Classification Accuracies 


NASA Grant NSG-2377 


Semiannual Progress Report 

tWO- 298 15 

Unci do 
002 o 8 

Alan H. Strahler 
and 

John E. Estes 
Co-Principal Investigators 
Geography Remote Sensing Unit 


(E80- 10268) USE OF COLLATERAL INFORMATION 
TO IMPROVE LANDSAT CLASSIFICATION ACCURACIES 
Seni annual Progress Report, Oct. 1979 - Mar. 
1980 (California Univ.) 75 p UC A04/MF A01 


r f\ a n s'* 3 yn 3 


University of California 
Santa Barbara, CA 
93106 


Technical Monitor: Dr. David L. Peterson, NASA-Ames 




UNIVERSITY OF CALIFORNIA, SANTA BARBARA 


BERKELEY • DAVIS • IRVINE • LOS ANGELA • RIVERSIDE • SAN PIKCQ * SAN FRANCISCO 



SANTA BARBARA • SANTA CRUZ 


SANTA BARBARA, CALIFORNIA 93 X06 


June 17, 1980 


Dr. David L. Peterson 
NASA-Ames Research Center 
Mail Stop 242-4 
Moffett Field, CA 94035 

Dear Dave: 

The purpose of this letter is to provide a semi-annual report of 
our activities under NASA Grant NSG-2377. The period covered by the 
report will be from October 1, 1979 to March 30, 1980, the six months 
immediately proceeding our second year renewal on April 1. The semi- 
annual report is submitted in fulfillment of our responsibilities as 
described in "NASA Provisions for Research Grants." My Tetter of 
December 10, 1979, serving formally as a report of activities during 
the July 1 to September 30, 1979 interval, also discusses some of 
the activities which have occurred in the October to December time 
period. Although I will refer to these activities in this report, 

I will emphasize activities occurring after the first of the year. 


Publications . I am very pleased to announce that during this 
period three manuscripts have been placed in the publication process. 
The first of these, entitled "The Use of Prior Probabilities in Maximum 
Likelihood Classification of Remotely Sensed Data," was enclosed with 
my letter of December 10. This manuscript has, within the past few 
weeks, been accepted for publication in Remote Sensing of Environment . 
Since the manuscript was accepted without revision, it should appear 
essentially the same as the copy which you possess. 

The remaining two manuscripts are in symposium proceedings. The 
first of these is "Incorporating Collateral Data in Landsat Classifica- 
tion and Modeling Procedures," which will appear in the Proceedings 
of the Fourteenth International Symposium on Remote Sensing of the 
Environment. The second is "A Logit Classifier for Multi-Image Data," 
which will appear in the Proceedings l’ EEE Workshop on Picture Data 
Description . I enclose copies as specified in the grant provisions. 


S5wi«-“ 


in', p*oWaP^ 




be um 

s" 7 9 i 



Dr. David L. Peterson 
June 17, 1980 
Page 2 


Travel . In March, 1980, my research associate, Mr. Curtis 
Woodcock, and I attended an informal meeting of California researchers 
involved in forestry applications of remote sensing at California 
Polytechnic Institute and State University, San Luis Obispo. At 
your invitation, we briefed the group on some of our current forestry 
work as well ?s the implications of our research concerning the incor- 
poration of collateral data into Landsat processing systems. We 
enjoyed this constructive opportunity to share research ideas with 
investigators working on similar applications. 

Grant travel funds were also used to support a side trip from 
Denver.. Colorado to Fort Collins in December, 1979, to allow me to 
confer with Dr. James A. Smith, of the School of Forestry, concerning 
spectral modeling and possible joint research in exploring the role 
of collateral information in the remote sensing process. 

Although not during the time period covered by this report, I 
also attended the Fourteenth International Symposium on Remote Sensing 
of the Environment, held in San Jose, Costa Rica, in April, 1980. At 
this meeting, I had many productive exchanges with remote sensing 
researchers, and was allowed the opportunity to present the results 
of our first year's work in this grant. I should also mention that, 
as a result of discussions at that meeting, I have been invited to 
present a seminar concerning the use of collateral information in 
remote sensing at the Canada Centre for Remote Sensing (CCRS) in late 
June. Travel and per diem expenses for this trip will be reimbursed 
by CCRS. 

Technical Progress . During this reporting period, we made sub- 
stantial progress in three research topics. The first of these is the 
survey of quantitative techniques available for merging categorical 
and continuous variables, and the assessment of the utility of these 
techniques for remote sensing applications. This survey was completed, 
and the results are summarized in the first part of the San Jose pre- 
print (attached). The second area of progress concerns the further 
application of prior probabilities in the context of a time-dependent 
land use classification system. Again, the results are summarized in 
the San Jose preprint. 


Dr. David L. Peterson 
June 17, 1980 
Page 3 


Perhaps some of our most exciting work concerned the development 
and application of the logit model to remote sensing. This model does 
not rely on multivariate normal assumptions, and allows probability 
of class membership to be predicted as a function of any array of 
independent variables, either categorical or continuous in nature. 
Further, the logit model may be formulated in a linear or curvilinear 
fashion, depending on the application. Our research in this area is 
summarized in the attached preprint, "A Logit Classifier for Multi- 
Image Data." 

Software . During the reporting period, we have also developed two 
new software items. The first of these is a program which fits a logit 
model to a set of calibration data which are input to the program. 
Although existing programs are available for logit modeling, none of 
these were suitable for the data which were available. The second 
software item was a new VICAR program, PROBMAPS, which accepts images 
of independent variables, input in MSS format, as well as the calibra- 
tion parameters for the logit model, and produces a set of output images 
recording the probability of membership of each pixel for each class as 
predicted by the logit model. Both of these programs are quite specific 
to the logit model application, but both have been coded in universal 
FORTRAN so that they can be compiled at other institutions. We will 
be happy to make this software available to you at NASA-Ames or to any 
other users as well. 

Future Plans . Our plans for the coming research period have not 
been completely formalized at this time. I hope to be able to pursue 
most or all of the following research topics: 

1. modeling universal soil loss for a pixel based on landform, 
land use, rainfall, and topographic parameters; 

2. improved rainfall runoff modeling on a pixel -by-pixel basis 
for use in hydrologic modeling; 

3. further exploration of the logit model including its ability 
to simulate various nonnormal distributions and perform as 

a classifier of remotely sensed data; and, 

4. continue the exploration of time-sequential classification 
of land use in Ventura County. 

This concludes my semi-annual report. 

Sincerely yours. 


Alan H. Strahler 
Principal Investigator 


$ 


X 


A LOGIT CLASSIFIER FOR MULTI-IMAGE DATA 


Alan H. Strahler and Paul F. Maynard 


Department of Geography 
University of California 
Santa Barbara, California 93106 


Preprints Proc, IEEE Workshop 
on Pioture Data Description ^ 
578(57 As 1 fomar, CA. ' 


A new classifier for itiultl-imane databases 
uses maximum likelihood estimation of parameters 
fitting a logit model to training data. A loqit 
is the natural logarithm of a probability ratio: 
e.g, , in (P,/P 0 ). As an example, a linear logit 
classification model for a simole two-class case 
based on p four Landsat channels is; 

In(-p) * 6o + 3t(LS4)+a 2 (LS5)+33(LS6)+S l ,(LS7) 
where P, z 4 5nd Pi are probabilities that the pixel 
belongs to class % and b respectively, Fa t ,,s 4 
are calibration constants, and LS4...LS7 and' 
the four Landsat channels for bands 4 through 7. 

Compared with usual Bayesian maximum like- 
lihood classification, the logit classifier has 
certain distinct advantages* It is non parame- 
tric, in that multivariate normality is not as- 
sumed, The model may be specified in linear or 
curvilinear forms as appropriate. Further, the 
model can incorporate categorical information in 
the form of dummy variables, and can therefore 
be used to merge continuously measured image 
data with categorical collateral data in a 
sinale classification steo, 

Introduction 

The Bayes maximum likelihood classifier (MLC) 
is the most commonly used decision rule for dis- 
crete classification with Landsat derived data. 

Such a classifier uses the Bayes decision rule to 
assign pixels to the class with highest probability, 
given the observed vector of spectral measurements 
and the prior distribution of classes. To find 
this probability, the Bayes MLC requires an esti- 
mate of the conditional probability of occurrence 
of the observed vector of MSS data given that it is 
associated with a specified class. This estimate 
has traditionally been obtained by assuming that 
the observed measurement vectors were Gaussian, or 
normal; therefore, the best estimate is a measure 
of the probability density value of the multivari- 
ate normal distribution for the class evaluated at 
the observation. 

With four channels of Landsat spectral data, 
the Bayes maximum likelihood classifier has achieved 
good classification accuracies in many cases. 
Classification accuracies have been further boosted 
by the inclusion of collateral data as additional 
logical channels or as indexes to sets of Prior 
probabilities in the case of categorical collateral 


variables. 1 Further increases in classification 
accuracy will undoubtedly result from more ootimal 
spectral channels and improved techniques of in- 
corporating collateral data. 

However, it 1$ very likely that the spectral 
reflectances from some classes in many applications 
are not normally distributed. In this situation, 
even with the ideal spectral "windows" and orecise 
and relevant collateral data, the Bayes MLC will 
reach an asymptotic accuracy limit that will be 
less than optimal, In fact, for classes that de- 
viate from normality, classification accuracies 
could be significantly less than optimal. 

Furthermore, each of the two previously men- 
tioned techniques of increasing accuracy by innut 
of collateral data to the Bayes classifier has 
critical limitations. The Bayes MLC can only 
accept measurement variables which are continuously 
distributed, and this requires continuous measure- 
ment (or at least discrete measurement with a 
large number of discrete steps). Consequently, 
the collateral channel in the direct input aoproach 
can utilize only data which are measured on the 
Interval or ratio scale. This requirement elimi- 
nates many ootentially useful databases, such as 
soils maps, geologic maps, political boundaries, 
census tracts, etc, And when collateral informa- 
tion is incorporated through the mechanism of 
prior probabilities, the calculation of the orior 
probabilities usually requires a sophisticated 
sampling design. 

One solution to these problems is to use a 
statistical technique that oredicts probabilities 
of categorical membership with no distribution 
assumptions, One such technique, called the logit 
regression model, has been widely used in the 
social sciences for the last twenty years. The 
logit regression model generates predicted proba- 
bilities that all sum to one for a specified suite 
of classes, and the classification can be awarded 
to the category with the highest predicted Proba- 
bility. Further, predictor variables may be con- 
tinuous or categorical; the model may be specified 
in linear or curvilinear forms; and the assumption 
of multivariate normality is not required. 

This paper has the following components: 

1, a review of the mathematics of the Bayes MLC; 

2. an examination of the sources of classification 
error due to non-normal distributions; 




3* an examination of the problems encountered with 
utilizing collateral data; 

4. development and exposition of the loait regres- 
sion model, with an emphasis on its ability to 
act as a nonoarametric classifier; 

5. a description of planned research which will 
compare the classification accuracies of the 
logit regression model and the Bayes MIC for a 
land use/land cover classification example; and 

6. presentation of an example of the use of linear 
logit model in a remote sensing application. 


The Bayes Maximum Ll kel ihood Classifier 
Background 

In the past ten years, maximum likelihood 
classification has found wide apolication In the 
field of remote sensing. Basel on multivariate 
normal distribution theory, the maximum likelihood 
classification algorithm has been in use for appli- 
cations in the social sciences since the late 
I940 l $, Providing a probabilistic method for rec- 
ognizing similarities between individual measure- 
ments and predefined standards, the algorithm found 
increasing use in the field of pattern recognition 
.n the following decades.' » V # In remote sensing 
development of multi-spectral digital images of 
land areas from aircraft or spacecraft nrovided the 
opportunity to use the maximum likelihood criterion 
in producing thematic classification maos of large 
areas for such purposes as land use/land cover de-r 
termination and natural cultivated land inventory;* 1 

'De rivation of the Bayes MLC 

In order to understand the difference between 
a classification awarded on the basis of logit-gen- 
erated predicted probabilities and posterior proba- 
bilities derived from the Bayes MLC, it will be 
helpful to briefly review the mathematics of the 
Baves maximum likelihood decision rule, In the 
multivariate remote sensing apolication, it is 
assumed that each observation X (pixel) consists of 
a set o^ measurements on ? variables (channels). 
Through the selection of training sites, a set of 
observations which correspond to a class is identi- 
fied -- that is, a set of similar objects charac- 
terized by a vector of means on measurement vari- 
ables and a variance-covariance matrix describing 
the interrelationships among the measurement vari- 
ables which are characteristic of the class. Al- 
though the parametric tuean vector and dispersion 
matrix for the class remain unknown, they are esti- 
mated by the sample means and dispersion matrix 
associated with the object sample. 


Mul ti variate normal statistical theory de- 
scribes the conditional probability that an obser- 
vation ,V will occur, given that it belongs to a 
class as the following function; 

•VW-v-u*) VUx-nJ (1) 








(Please refer to Table 1 for a description of the 
mathematical symbols). As applied in a maximum 
likelihood decision rule, expression (1) allows the 
calculation of the conditional probability that an 


observation is a member of each of > classes, How- 
ever, the actual orobability desired is the poste- 
rior probability it can be shown 4 that; 

p'*.i.Yi » ynw ( 2 ) 


This expression leads to the decision rule: 

Choose <> which minimizes 

1 n (D-. |+(.T-m ; .) 1 0' 1 (.r-m f .) -21 nPt u.,.1 , (3) 

In usual practice the prior probabilities PC**.} are 
assumed equal, or where 8 1$ the number o f 

classes. In this casd, the last term in expression 
(3) is constant over all 8 classes, and need not be 
considered in the decision rule, This equal priors 
decision rule is used in the currently distributed 
versions of LARSYS and VICAR, two image processino 
systems authored respectively by the Laboratory for 
Applications of Remote Sensing at Pt-.rdue University 
and the Jet Propulsion Laboratory of California 
Institute of Technology at Pasadena, 

Classification Accuracy and the Assumption of Nor* 
maTity 


In a typical supervised classification, train- 
ing sites are selected by the analyst to typify 
each class. Histograms of spectral values for 
classes are inspected for multivariate normality, 
and when a class actually consists of several dis- 
tinctive slanatures, training sites are reaggre- 
gated Into subclasses, each of which is approxima- 
tely multivariate normal, In this way, a set of 
multivariate normal dispersion patterns are defined 
for the desired classes. It is important to real- 
ize that such dispersions, because they are se- 
lected to be as "pure 11 as possible, are probably 
underdispersed with respect to the true information 
class. This effect produces a difficulty in the 
classification of mixed pixels. Since the MLC 
model does not provide for mixed pixels, the impli- 
cit assumption is that mixed pixels are to be clas- 
sified according to the most probable signature 
match; the components of the signature which are 
reflected from the less Important classes contained 
within the pixel are thus regarded as random noise. 
The mixed pixel, then, is typically classified by 
comparing probability densities within the tails of 
overlapping multivariate normal distributions. The 
accurate classification of mixed pixels under MLC 
thus requires a good fit of the tails to multivari- 
ate normality; however, it is obvious that the 
selection of training sites for ourity will of 
necessity produce a poor fit in the tails. And, 
mixed pixels will constitute a large portion of the 
scene — up to forty percent in some agricultural 
applications (F t Hall, personal communications), 

This reasoning naturally leads to the consi- 
deration of nonparametric classifiers. Brooner et 
al. 7 compared the Bayes MLC to a classifier which 
used PLYjui,.} as directly estimated by a sampling 
procedure, ^and reported a four percent increase in 
classification ticcuracy over MLC, However, direct 
estimates of PCTltu*} require more data as well as 


3 


wealing with ."-dimensional table storage problem. 
The logit model, discussed in the following pages, 
provides an alternative which should require fewer 
data to calibrate and can, by virtue of its curvi- 
linear modeling, approximate the real distribution 
without assuming multivariate normality. 

Problems with Incorporating Prior Probabilities 

As shown earlier, the Bayes MLC can easily be 
modified to take into account prior probabilities 
which describe how likely a class is to occur in 
the population of observations as a whole. The 
orior probability itself is simply an estimate of 
the proportion of pixels which will fall into a 
particular class. These prior probabilities are 
sometimes termed weights, since the modified clas- 
sification rule will tend to weight more heavily 
those classes with higher prior probabilities. 
Strahler 1 showed via simplified numerical examples 
how these different weights can affect the decision 
of the Bayes MLC. As the prior probability Piwj,) 
in expression (3) becomes large and approaches I, 
its logarithm will go to zero and the classifica- 
tion decision will effectively be made with ex- 
pression (1). 

However, since this possibility and oil 
others must sum to one, the prior probabilities of 
the remaining classes will be small numbers, thus 
increasing the value of the expression. Since the 
classifications is ararded to the class with the 
smallest value, the effect will be to force classi- 
fication into the class with high probability, 
Therefore, the more extreme are the values of the 
prior probabilities, the less important are the 
actual observation values ?. 

Strahler 5 has demonstrated how prior proba- 
bilities can be used as a mechanism to incorporate 
collateral data in categorical from into the Bayes 
MLC. His mechanism uses a set of prior probabili- 
ties estimated for each collateral category by an 
external sampling procedure, with the classifica- 
tion algorithm accessing the appropriate set of 
probabilities contingent upon the collateral cate- 
gory of the pixel. In this fashion, categorical 
collateral information is merged with multivariate 
normal information concerning the spectral signa- 
tures. Although this approach was proven effective 
for a forestry application, 'estimations of the sets 
of prior probabilities may require considerable 
data collection, depending on the number of classes 
and collateral categories. In addition, multivari- 
ate normality of signatures is assumed, and the 
comments of preceding paragraphs apply. In con- 
trast, use of the logit classification model allows 
categorical and continuous information to be mixed 
freely through the mechanism of dummy variables. 
And, again, the model can be fitted in a linear or 
curvilinear fashion as desired, avoiding the 
assumption of multivariate normality. Thus, the 
logit model offers a more natural, straightforward 
way of incorporating categorical variables into 
the classification procedure. 


The Logit Regression Model 
Linear Modeling of Probabilities 

The most commonly used predictive multivariate 
statistical technique is probably ordinary least 
squares (OLS) regression. The prediction, or esti- 
mated value of the dependent variable, is a func- 
tion of the vector of estimated betas (§) in combi- 
nation with the vector of observed Independent 
variables (if). The betas are estimated in such a 
way that thil variance about the least squares re- 
gression line is minimized, 

When used to model probabilities, OLS regres- 
sion has one major drawback: although probabili- 
ties are constrained to lie within tne range of 0 
to 1, the predictions generated from such a model 
are unbounded and may take values from minus infin- 
ity to plus infinity, Thus, the predictions may 
lie outside the meaningful range of probability. 
Further, the probability of each class must be 
modeled separately, and there is no constraint to 
ensure that all probabilities must sum to one. 

One solution to the bounding problem is to 
specify that 

0 £ P. , < 1 

(where P„. is the probability of observing a speci- 
fied claSs or category of the dependent variable). 
In the case of ordinary least squares regression 
model , 

ij = Sq + lli.Vvi + 8 zXj*. 

The simplest way to satisfy this condition is to 
Impose the following arbitrary definition of P„.: 

P, is equal to 0 if Y, is less than 0; 

2, p; is equal to Y. if Y. is equal to or between 
O v and 1* ? ' 

3* is eoual to 1 i* Y„, is greater than !: 

and use straightforward ordinary least squares 
estimation of the regression parameters. This 
solution is often referred to as the linear proba- 
bility model . Unfortunately, although it appears 
to be a simple solution to the predicted probabili- 
ties problem, the model has a number of serious 
limitations which are discussed by Domencich and 
McFadden* Again, the probabilities are not con- 
strained to sum to one. 

The Logit Model 

The simplest yet most statistically sound, 
solution to the probability problem (within a re- 
gression framework) is the logit transformation. 

In this transformation, the ratio between the pro- 
bability that an observation or pixel i belongs to 
a class ?i and the probability that it does not 
belong to ?i is expressed as a logistic function; 

p :-> i § \X 


( 4 ) 


4 


whare o Is a vector of parameters and ■' Is a vector 
of observations on independent variablis. Taking 
the natural logarithm of this expression, 


requires more calibration data than MIX because of 
the larger number of Daranteters which need to be 
estimated. 


X > ( 5 ) 

) * 

The left-hand quantity is referred to as a Icqit. 
Note that when s.X is zero, the ratio will be 1, 
Indicating equal probability. As 38 varies posi- 
tively or negatively, the ratio will shift accord- 
ingly. 

The ratio, under the constraint that the numer- 
ator and denominator must sum to one, determines 
the two probabilities uniquely, Expression (4) 
can be solved explicitly for P,! 

p. . , ,-xM • (6) 


And, if P. Is defined as 1-P*, it is easy to show 
that 


Although expressions (4) and (5) show the 
product *X as a linear function, the X vector may 
contain powers and cross products in the case of a 
curvilinear model . An example is (elemental nota- 
tion)? 


P, 

1 n { — ) M.:i rs 
1-p, 




( 8 ) 


for the bivariate case. 

Unlike the conventional regression models, the 
logistic and linear logit regression models require 
either a weighted least squares (V/LS ) procedure or 
a maximum likelihood procedure to estimate the 
calibration parameters (betas), The choice between 
the two methods depends upon whether or net the 
sample under investigation includes repeated ob- 
servations for each combination of values of the 
explanatory variables, If so, WLS is appropriate*, 
however, remote sensing applications rarely have 
repeated observations. Consequently, the method 
of maximum likelihood is the preferred method, 

A number of authors present the details of this 
method, which is discussed briefly in a following 
section, 3 » 


Maximum likelihood estimation of loqit model 
parameters has many other attractive features. 
Provided that the sample data are not multi col - 
1 inear, a unique maximum likelihood estimator can 
be obtained even in relatively small samples, 

Also, the mathematical properties of the likelihood 
function allow for efficient computer programs to 
produce the parameter estimates. These estimates 
are consistent and are the best possible estimates 
in very large samples, The disadvantages of the 
procedure are that it involves numerical optimi- 
zation and therefore more computation, and that it 


logit Example 

The following is an example that uses contin- 
uous and categorical explanatory variables to esti- 
mate 


In ( 


1 -P/ 


-) * fo (h&vHhSs 


(9) 


where Pj ,/(l-P f .) is the ratio of the probability 
that pixfet t is 8 hot of class 1, X f * refers to a con- 
tinuously measured variable on pixel i (MSS or con- 
tinuous collateral data), and Pi Is a dummy variable 
that is equal to one if a categorical variable is 
true at pixel i and zero if it is not* Given the 
logit ratio, simple algebra will extract the value 
of It is straightforward to add more con- 

tinuous and categorical explanatory variables. 


Estimating the Regression Parameters 


In order to calculate the loqit ratio in the 
preceding formula, it is necessary to obtain esti- 
mates of the regression parameters. The first step 
is to specify the model in terms of a likelihood 
function. If the training observations are thought 
of as independent trials, then the likelihood of 
the outcome of these trials (for the two-class case 
described above) Is; 


l 


n x 

JIp. 

i*V t 


,1 11 

{*n x H 


( 10 ) 


where observations to r\\ are those in which the 
observed dependent variable was a member of class 1 
and ’-ni + 1 to n are the observations in which the 
dependent variable was not a member of class 1. 
Substituting from the definitions of P : * j and 1-P.* j 
in expressions (6) and (7), the result is 


n\ 

n. 

i* l 








(ID 


As specified, the likelihood depends upon a 
set of unknown parameters, the betas, These para- 
meters are estimated by choosing those values which 
maximize the preceding likelihood formula for the 
given set of training data. Rather than maximize 
the likelihood itself, it is computationally sim- 
pler to maximize the logarithm of the likelihood, or 

n\ n 

In L - s itv- s ln(l+o-I) t (1?.) 

To maximize this expression it is not possible to 
set the partial derivatives of In (L) with respect 
to the betas to zero, and solve simultaneously for 
the betas in a direct fashion. Instead, the solu- 
tion must be obtained by iteratively recalculating 
In L for successive estimates of betas until the 
partial derivatives converge upon zero. 


There are several mathematical techniques for 
iteratively converging the first partial deriva- 
tives to zero. One well-known technique 1$ the 
■lewton-Raphson Method, which calculates deltas 
{amount of change) for the betas by forming the 
matrix of second partial derivatives, inverting it, 
and postmul tl plying it by the vector of the first 
partial derivatives. That is, 


where i is the vector of calculated deltas, S’ 1 is 
the inverted second partial derivative matrix, and 
; is the vector of first partial derivatives. The 
deltas are subtracted from the betas, and the sec- 
ond partial derivative matrix and first partial 
derivatives are recalculated with the new betas. 
The process continues until the vector of first 
partial derivatives has converged uoon zero, at 
which point the most likely vector of betas has 
been identified. 


Given these maximum likelihood estimates of 
the betas, the last step in a logit model classifi- 
cation sequence is to use expressions (6) and (7) 
to calculate the vector of probabilities for each 
pixel and award the classification to the category 
with the largest predicted probability. 

Pol ychotomous Logit Regression 


The preceding examule, although conceptually 
straightforward, is not applicable when there are 
more than two categories to be predicted. The 
dichotomous logit can be easily extended to the 
polychotomous logit. Now, instead of two catego- 
ries of interest, there are p possible categories. 
The model now becomes: 

§j, j > 


P, 


* Mf 


(13) 




where P, Is the probability that pixel { belongs 
the 2 >Cn category, 


Because of the constraint that the probabili- 
ties must sum to one, only £-1 sets of betas and 
probability ratios need to be determined* Intro- 
ducing this constraint on the ; ? th class* it is 
easy to show algebraically that 


P, 


isS.Pj 
0* i * 


* 


1 



(14) 


This constraint can also be introduced by taking 
§ V *Q, which produces an identical exoression from 
substitution into expression (13), 


For estimates of the betas, the maximum like- 
lihood estimation procedure, described above for 
the two-class case* is generalized to the £-dass 
case in a straightforward manner, 

Proportion Estimation 

Although the discussion above has stressed the 
use of the logit model for classification, it may 
also be used for proportion estimation, Nelepka 


Klamath National For«t T«*t Site Location 



Figure 1, Index map showinq location of area 
modeled in Klamath National Forest. 


et al. n and Woodcock et al. have both discussed 
this problem using underlying assumptions of multi- 
variate normality, The logit model provides an 
alternative which does not assume multivariate nor- 
mality and estimates proportions directly. As in 
classification, either linear or curvilinear models 
may be selected, and cegegorical variables may be 
readily utilized as well. The difference between 
application of the logit model as a classifier and 
as a proportion estimator lies in the nature of the 
calibration data. For the classifier* trainino 
observations of x vectors (for pixel ?) are each 
individually labeled with a single class; in the 
case of proportion estimation, each observation 
contains the observed proportions of classes and 
the associated measurement vectors. These propor- 
tions constitute weights* and it is easy to show 
that the likelihood function becomes (as in the two 
class case)? 


In (L) = v >• j, , + if i jj „ 

.’si »*• 1 • •' • ••• • «* 


;'=! 

3..Y, j+6,.?. , 

v n 1n(l + £> 1 v • * ' ), 


U 

1*1 


where j,. k is the proportion for the * th observa- 
tion for’class and v is the observation weiqht 
(which, as a constant, is eliminated in differen- 
tiation of the fraction), 


6 


Applied I on Example 

At the present time, the logit-based classi- 
fier has not been tested, although a logit model 
has been used In a forestry remote sensing problem 
of proportion estimation. This use Is summarized 
In the paragraohs below. For this application, a 
linear logit model was devised and fitted to forest 
species compositional data for northern California, 
predicting the proportion of timber volume for each 
of five coniferous tree species at each pixel based 
on registered terrain data quantifying elevation, 
Slope, and aspect, This research utilizes the 
Video Image Communication and Retrieval (VICAR) 
system and the Image 8ased Information System 
(IBIS) resident at the University of California, 
Santa Barbara, VICAR/IBIS, developed at the Jet 
Propulsion Laboratory (JPL) at Pasadena, California, 
Is a job control language which permits the sequen- 
tial linking and execution of a vast array of 
Fortran and Assembler routines in a batch environ- 
ment. In addition to extensive usage of existinq 
VICAR/IBIS routines, new VICAR and non-VICAR soft- 
ware were developed and/or modified as required for 
this application, 

Logit modeling of species proportions used 
data derived from the Klamath National Forest, lo- 
cated in northern California (Figure 1), Ranging 
in relief from 500 to 8,000 feet, the Forest in- 
cludes 2,600 square miles of rugged terrain in the 
Siskiyou, Scott Bar, and Salmon Mountains. Little 
of the area is developed beyond management for 
timber yield, livestock production, and recreation, 

A wide variety of distinctive vegetative types is 
present in the area. Forest vegetation Includes 
such coniferous species as noble, red, white, and 
douglas firs, ponderosa pine, and incense cedar, 
as well as several oaks, and typical species of 
chaparral, Thus, the topographic and veaetational 
characteristics of the area are well differentiated. 
Within the Klamath National Forest, a study area 
including most of the Goosenest Range was selected 
for logit modeling of species composition from 
terrain features. This area was chosen because 
calibration data and Landsat Image* were readily 
available for it. 

Digital Terrain Model 

The logit model ?•;•<" this forestry application 
requires preparation of digital terrain data. 

These data, obtained from the National Cartogra- 
phic Information Center, in Reston, Virginia, are 
derived from processing of 1:250,000 contour maps, 
and Include elevations at every point on a grid of 
approximately 65 m spacing. Although the data are 
comparable in scale to a Landsat image, the eleva- 
tion values are quite generalized because they are 
produced from small scale contour maps by interpo- 
lation. 

Slope angle and slope aspect channels can be 
produced using the elevation data of the regis- 
tered terrain image. Although a number of slope 
and aspect generating algorithms are known, the 
simplest is the fitting of a least squares plane 
through each pixel and its four nearest neighbors 
and the calculation of the downslope angle and 


direction of the plane. Slope aspect was trans- 
formed from a coding of zero to 255 representing 0* 
to 359*tQ a cosine function shifted by 4§'\ This 
function, proposed by Hartung and Lloyd , w contrasts 
northeast-facing slopes, which present a favorable 
cool, moist growing environment, with hot, dry 
southwest-facing slopes. Although t ,h “ 'unction is 
defined ecologically, It also simulates Lambertian 
reflectance from a light source placed in the north- 
east, and thus the aspect image shown In Figure 2 
gives the strung visual Impression of relief. 


Logit Model 


The logit model fitted is 

fv 

Inter 1 ) * Sj.f- + 'k,pZ + h,k A + ^4,^, A ■ 1.5, 

U 


where is the probability that a board-foot of 
timber 'Volume will be drawn from one of five species 
k, p. is the probability that the board-foot will 
not be drawn from species k, s is elevation (com- 
pressed to 0-255 range), A is aspect transformed as 
described above, a is slope angle, and fij Si,,;, 
are the estimated regression constants, Note that 
five equations, one for each spucles, actually com- 
prise the model. The model was calibrated using 73 
measurements of timber volume prepared by the U. S, 
Forest Service and located within two subregions of 
the Goosenest range. These samples are probably not 
representative of the entire area modeled, but serve 
for the demonstration purposes of this research. 

Each sample was located on 1:15,840 scale color air 
photos and transferred to Band 5 of a registered 
Landsat image to obtain the line and sample coordi- 
nates of the sample point. The coordinates were 
then used to extract elevation and aspect values for 
the sample from the registered elevation and aspect 
images, The coefficients for the model were fitted 
by a nonlinear optimization algorithm employing the 
Newton-Raphson method described above. 


Given the constants produced by this procedure, 
the probability images were created using the new 
VICAR program "PROBMAPS." PROBMAPS, written speci- 
fically for this aoplication, calculates the pro- 
bability of species k for each pixel t using the 
following expression: 

hh 


P 


L,k 



k * 1,5, 


J° 1 


PROBMAPS then scales each probability so that the 
range 0=255 represents 0 to 1, PROBMAPS output 
Images for this example are shown In Figure 2, 
Brightness values represent that probabilities occur- 
rence for douglas fir, ponderosa pine, white fir, 
and red fir, with probabilities scales to ranqe 
from black (0<) to white (1.0), The incense cedar 
image has been contrast stretched for display pur- 
poses, and presents a probability range of 0. to .3 
from black to white. The probability Images repre- 
sent maximum likelihood estimates of species propor- 
tions; they appear reasonable in light of the known 
ecological preferences of the species, but their 
accuracies remain to be determined. 




Figure 2. Clockwise From uooer left: cosine function of slooe asoect; probability Imaoes of coniferous 

forest species red fir; white # 1r; Incense cedar; pcnderosa olne; and douolas fir; for Soosenest test area 
within Klamath National Forest. For probability Imaoes, only area within Forest boundary Is shown. 




* 








a 


c utur> Wo rk 

Although the toait classifier appears to have 
some unique advantages in ». "motional maximum 
likelihood classifications ;.cther work will be 
necessary to prove its value for remote sensing 
applications. Topics to be investigated include: 

1. Model Specification . For what shapes of non- 
normal distribution are linear models appro- 
priate? Under what conditions are curvilineai 
models necessary? Could a stepwise procedure, 
analogous to polynomial curve fitting, be 
devised for model calibration? Since it is 
possible to obtain assymptotic estimates of 
the standard error of each beta, could the 
stepwise procedure droo individual terms from 
the model which are not significantly differ- 
ent from zero? 

2* Accuracy . How well are probabilities pre- 
dicted'? Can a confidence limit be placed on 
the predicted probability? Monte Carlo meth- 
ods may be helpful here. How does accuracy 
interact with distribution shape? 

3. furt her AgpUcat'ons . "he logit model needs 
tolse exercised' on a real classification oro- 
blem and compared with conventional MU*. 

Which is more accurate? Which consumes more 
computational resources? Do categorical vari- 
ables present any special problems? 

These questions and others will be the subject 
of future research in the application e- the logit 
model to the remote sensinn problem. 

Table 1. Notation 

Term Definition 

■■■■»> ■■ mmm i wnwci i. iiwu p.i.p w ***** 

” Number of measurement variables used to 

characterize each object or observation. 

’’ A r> dimensional random vector. 

.V, Vector of measurements on ? variables 

associated with the -5th object or obser- 
vation; '*1 ,2,, , , 

r Probability that a dimensional random 
vector will take on observed values 

«... Member of the ':th set of classes u$ %»1, 


P(*,i Probability that an observation will be 
a member of class w,/, prior probability 
of class •*,. 

Pi.vj^J Probability density value associated 

with an observation vector £ as evalu- 
ated for class 

■>>(*) Probability density value times prior 
probability for observation vector X 
evaluated for class 

a,. Parametric mean vector associated with 

the ;-th class, 

Gj, Parametric p by p dispersion (variance- 

covariance) matrix associated with the 
*:th class. 


Term Definition 


0*. r by r dispersion matrix associated with 
a samplv of observations belonging to the 
i'th class; taken as an estimator of "... 

’• Summation sign, add together ail occur- 

>1 ranees of f from 1 to «. 

’I Product sign, multiply together all occur* 

rences of l from 1 to n. 

i! Estimated vector of regression parameters. 

Mean vector associated with a sample of 
observations belonging to the Hh class; 
taken as an estimator of *j,. 

References 

*Strahler, A.H, (1980). The use of prior probabili- 
ties in maximum likelihood classification of 
remotely sensed data: Remote Sensing of En- 
vironment . in press. 

"Chow, C.K. (1957). An optimum character reconni- 
tion system using decision functions. 1REE 
Trans. Electron. Computers 6. pp. 247-254'." 

3 

Sebestyen, G. (1962), Deds*on*Makfnq Processes -In 
P attern Recooni tion HacMl 1 fan > New York, ~~ 

Nilsson, N.S. (1965), Learning Machines - Founda- 
tions of frainableTattern - Classifying Sys- 
tems. Mei)raw-Hi 1 ) Inc . . New York. 

’'Schell, J.A, (1973), in Remote Sensing of Earth Re- 
sources , Volume I (F, Shahrokhi, Ed.), Uni- 
versTty of Tennessee Space Institute, Tulla- 
homa, Tn,, pp. 374-394, 

'Reeves, R.G., Anson, A., and Landen, D, (1975), 

Manual of Remote Sensing . Amer. Soc, of Photo- 
grammetry, Falls Church, VA, 2 vols., 2144 pp. 

Brooner, W.G., R.M. Haralick, and I. Dinstein (1971), 
Spectral parameters affecting automated image 
interpretation usinq Bayesian probability 
techniques: Proc, Seventh [ntl.Symo. on 
Rem. Sens. nfTnv. , pp. ' T929-1946V 

fi 

Donienlch* T, A. and McFadden, 0, (1975). Urban tra- 
vel demand: a behavioral analysis . Amster- 
dam, North-Holland, 

Cox, O.R. (1970), The analysis of binary data . 
London, Methuen. 

Mantel, N. and Brown, C, (1973). A logistic re- 
analysis of Ashford and Snowden's data on res- 
piratory symptoms in British coal miners, 
Biometrics 29. pp, 649-65, 

Wigley, N, (1975). Analyzing multiple alternative 
dependent variables. Geographical Analysis 
7, pp. 187-95. 

Vchmidt, P. and Strauss, R,P, (1975b). The predic- 
tion of occupation using multiple logit mod- 
els. Int. Economic Review 16, pp. 471-86, 


9 


Nalepka, R.F-, Horwitz. A.H. , Hyde, P.0,, Morgen- 
stern, J.P, (1972) i Classification of spa* 
tially unresol ved objects. Manned Spacecraft 
Cen ter, 4th Am. garth Resources Program Rev. . 
VoITT , 


Woodcock. C.E, , Smith. T.R. , Strahler, A.H. (1979). 
a new model for estimating proportions of land 
cover within a pixel (abstr.), Machine Pro- 
cessing of Remotely Sensed Data Symposium . 

Hartung, R.E., Lloyd. W.J. (1969), Influence of 
aspect on forests of the Clarksville soils in 
Dent County. Missouri. J. Forestry 67s 178- 
182. 


Acknowledgements 

This research has been supported by NASA grant 
NS*j- 2377 . The authors would like to thank Joseoh 
Scepan and Tara Torburn for the illustrations, and 
Debbie Heath and Kathy Bresslin for the typing. 


incorporating amrajAL data m landsat 


<?& 6, 




■•'o 


CLASSIFICATION AND t-PDELING PROCEDURES 


A.H. Strahler, J.E. Estes, P.F, Maynard, 
F.C. Marts, D.A. Stow 

Departsent of Geography 
University of California 
Smca Barbara, CA, 93106, U.S.A. 


ABSTRACT 

A number of existing statistical techniques can be used to merge spec- 
tral image data with collateral information, including regression, ANDVA, 
MANOVA, ANCCVA, MANCOVA, discriminant analysis, maximal likelihood classi- 
fication with or without prior probabilities, contingency cable analysis, 
and logit modeling. The choice of an appropr i ate technique depends upon 
Che nature of input and output variables — continuous, discrete, or cate- 
gorical — and the ap p ro p ri ate model — parametric or nonparamerrlc. 

Logit modeling is a very versatile technique which is well adapted to 
remote sensing application. The logic, which is Che natural logarithm of 
m odda ratio for two states of an output categorical variable, is predicted 
by a l in e ar or curvilinear function of continuous or categorical input vari- 
ables. Since the logit models probabi l ities or oroportions, it can be used 
directly as a classifier or indirectly as an' estimator of prior probabili- 
ties for conventional maxing likelihood classification with prior probabil- 
ities. The logit model is ncnparametric, a feature which makes it highly 
desirable when used to merge disparate types of collateral data. The dis- 
advantages of the logit model are that, more calibration (training) data are 
required to fit the model, and that fitting requires an iterative nonlinear 
oprimiracicn procedure. A logit modeL was devised and fitted to forest 
species conpositicnal data for northern California, predicting Che propor- 
tion of timber volute in each of five species at each pixel based on regis- 
tered terrain data quantifying elevation and 3lope' aspect. 

Another versatile tool is mnxiimin likelihood classification with prior 
probabilities . By making prior probabilities conditional cn a collateral' 
data channel, the information contained within the channel can be conveyed 
to the maximal likelihood algorithm. An example is in land use, in which 
a previous classification and an externally devised transition probability 
matrix are used together with new image data to produce an updated classi- 
fication consistent with the observed pattern of change. This technique 
has relevance for monitoring urban expansion and the impact of forest 
clearing in developing nations. 


1. INTRODUCTION 

Viewed in a broad context, the problem of cotrbining image data and spatial collateral data 
into a predicted output map or image is actually a problem of combining continuous and categori 
cal variables in a modeling framework capable of producing continuous or categorical outputs. 

At the University of California, Santa Barbara , NASA-supported research (grant NSG-2377) is cur 
rently underway to identify existing models and procedures for spatial modeling and to apply 
them to selected datasets to demonstrate their applicability in a remote sensing situation. 

Collateral data, here defined as preexisting spatial information in the form of a map, pro 


cessed imtge, or sec of observations at grid coordinate locations, can be combined with Landsat 
or other remotely sensed digital laager/ to enhance classification accuracy or to construct 
models which predict spatial patterns of gromd phenomena, Collateral inf oration can be either 
continuous or categorical in nature. Examples are elevaticn, slope, or aspect channels obtained 
from a digital terrain model (continuous) j and rock type, soil type, crop type, or lend use 
(categorical) . Image data, with which collateral information are to be conmned, are typically 
continuous in nature, although values may ba quantized into discrete gray tons levels for data 
processing applications. Desired outputs may also be either categorical or continuous. Any type 
of classification constitutes a discrete or categorical output, whereas such outputs as percent 
bare ground, timber volume, soil loss, or forage cover are continuous in character. 

Figure 1 presents appropriate techniques for the merging of continuous, categorical and 
mixed (both continuous and categorical) datasets to produce either continuous or categorical out- 
puts, Continuous output models, including regression, analysis of variance, and analysis of co- 
variance, are all mathematically related aid based on least squares algebra, Categorical output 
models are more diverse and include variance maximizing techniques such as discriminant analysis 
as well as the nonparamecric methods of contingency table analysis and logit modeling. Maxinun 
likelihood classification may be viewed as a special case of logit modeling in which the input 
variables are assured to be normally distributed. Nonparamecric and mtprimtm likelihood techni- 
ques, because they produce probabilities of classification as an output, also have the advantage 
that they can be adjusted for prior probabilities. Because many present remote sensing applica- 
tions call for categorical classification, these latter methods are probably most useful In com- 
bining continuous image data with categorical and continuous types of collateral data. 

Demonstrations of a selected set of these techniques are planned and rnder current develop- 
ment; their current status is discussed in following sections. Several of these examples have 
important implications for remote sensing in developing nations. In one application, logit 
modeling is used to fit a model which describes the probability of occurrence of various forest 
species given elevation and slope aspect values obtained from a registered digital terrain model. 
Once obtained, these probabilities can be used as prior probabilities in a maximal likelihood 
classification of a Landsat image with registered terrain data for natural vegetatira units. 

This technique could facilitate the accurate Identification of forest types in carplex tropical 
upland environments , 

In another exanple, land use classification' lor change detection monitoring is Im p roved by 
a contingency table-analytic technique which quantifies the probabilities of change for each 
land use type during the fixed time interval. This technique has important implications for many 
developing nations, especially in Central America, ^iiere urban expansion is impacting .agricul- 
tural land, and forest clearing for agriculture is inpacting Large natural environments. Addi- 
tional exanples are being developed, focused on Landsat' and collateral datasets obtained for an 
area of Ventura County, California. 


2. STATISTICAL TECHNIQUES 

Figure I identifies a set of statistical techniques which are relevant for combining colla- 
teral data in the context of Landsat modeling and classification. The techniqes can be seen as 
a double level hierarchy, ranging frem continuously measured independent variables in the first 
colum to categorically measured independent variables in the last coluan. In the first row, 
the dependent variables are are continuous in nature, and in the second row the dependent variables 
are categorical. Continuous variables are measured on interval or ratio scales, whereas categori*- 
cal variables are measured at nominal or ordinal scales . Categorical variables can be of three 
types; 

(a) dichotomous (e.g. , presence or absence, yes or no) 

(b) unordered polychotcmous (e.g., land uses; agriculture, urban, forest, etc.) 

(c) ordered polychotcrous (e.g. , lew runoff, medium, and high runoff) 

The statistical techniques in the first row have been thoroughly doeuwnted and are c o m u nly used 
in social science (Blalock, 1972; Graybill, 1961; Morrison, 1967; Winery. 1971> / However; 
there has been relatively little work in the area of applying these approaches to remote sensing. 
The second row of the figure describes techniques which are generally less well known, but also 


2 




I 


i 


include tnudsun likelihood cals sificac ion as c o nm on ly carried our in ranoce sensing. The retrain- 
ing portions of this section discuss the theory and remote sawing application of the techniques 
identified in Figure 1. 

2.1 CONTINUOUS RESPONSE VARIABLE 

The statistical techniques of the first row of Figure 1 share one thing in cannon — they 
are all different forms of the basic linear regression model. Consequently, the theory and 
application of Che different types of regression in row one will be vary similar. Cell (a) will 
be examined in detail, aid except where explicitly specified, the analysis can be extended to 
cells (b) and (c). 

2.1.1 CELL (a) — CONTINUOUS EXPLANATORY VARIABLES. The most conmonly used method in this cell 
is regression, which predicts a continuously measured response variable from continuously meas- 
ured explanatory variables . The model (in vector notation) can be written*. 

X - BAT + i 

s - 

where T is the vector of observed dependent variables , a is the vector of umlrown parameters , X 
is the - vector of observed independent variables, and e is the vector of errors . (Table I des-' 
cribes the synbols used in this paper,) Typically, ti fie vector 2 is rnknewn, and mat be esti- 
mated from a dataset for which both response and explanatory variables are observed. This esti- 
mation is done by the process of ordinary least squares regression (CIS ) . QLS regression finds 
the slope and the intercept of a line riming through the data which minimi ges the variance of 
£(y-y) a , or the sun^of the squared differences between the observed T and the predicted X. The 
vector of predicted X value is calculated by; 


t - U 

The regression is acconplished by defining the quantity V equal to*l (X - 7) leaking jth* first 
partial derivatives of Q with respect to the values in the vector 2 arid setting these partials 
equal to zero, By definition of a pavital derivative, a minimum has be«i found. 

The best statistic for measuring the strength of the regression is R 2 * There ore several 
ways of calculating R z , but the following is conceptually the simplest. The residual sun of 
squares (RSS) or the total anount of variance in the modal not explained by the regression is 
calculated by; 

n 

rss m z Or. - x .) 2 

i*. i t z 

The RSS , when divided by the total corrected sum of squares in X (written’ TSS) , gives the propor- 
tion of unexplained variance to total variance, fl 2 , or the proportion of explained variance, is 
obtained by subtracting this ratio from one: 

R 2 - 1 -.222. 

TSS 

Oie example of hew regression could be used within the context of merging Landsat and col- 
lateral data is biomass modeling. The regression example used for cell (a) is restricted to 
continuously measured independent variables. For simplicity, only one spectral channel and one 
collateral data source are used. Adding other channels and other collateral data sources is a 
straightforward procedure, A basic linear model could be: 

§j m 3o + t\ (t&Sj) + 

1* * u 

where 3. is the predicted biomass for pixel i , )$SS* is observed nultispectral scanner data (as 
single Band, or nultiband ratio or transform) for f>ixei i, rain, is observed rainfall on pixel 
i t and the vector of betas (£ 0 ...6 2 ) a re the estimated regression parameters , 

Regression trcdels are applied as a two stage process. First, the model must be calibrated 
(estimate the vector of 8’s) by regressing the observed independent variables on the observed 


3 


dependent variable. In this exanple, biomass data are required freer a* sufficient mmfcer of 
pixels to give a representative sanple of the Landsat image to be modeled. The locations of 
data points are then "rubber-sheeced" to a geometrically corrected Landsat image, and the 1 inkarf 
biomass and MSS data is directly accessed by a statistical software package, such as the Statis- 
tical Analysis System (SAS) , to calculate the betas and a 2 . Secondly, if a 2 is s tatistically 
significant (i.e,, the calculated betas explain a jivm.fi cant portion of the total variance in 
y) , the model is extended as a predictor to othev pixelt where the dependent variable has not 
been observed. This, in effect, constitutes a nr del nf biomass predicted oy surface reflectance 
and measured rainfall. 

2.1.2 CELL (b) — MEED EXPLANATORY VARIABLES. This cell includes conventional regression 
models which are similar to those of cell (a) but also include a mixture of continuous and cate- 
gorical explanatory variables. Such models are cannon in social science research, the categori- 
cal explanatory variables often being termed "chinny" variables. It can be shown that the more 
fadliar statistical test Analysis or Covariance (ANCOVA) is a straightforward extension of OLS 
regression with dumy variables (variables treat assume values of 1 or 0 depending on the presence 
or absence of the qualitative variable being measured) . 

As an exaiple, the model used in cell (a) will be extended to cell (b). In this cell we 
are able to include data measured categorically — for exanple, soil type which can be observed 
in two states, referred to as class 1 and class 2. The model now becomes: 




2 0 + 


M + h°i\ + 


where the b:! omasa for pixel i is predicted by the variables used in cell (a) in c cob inanion with 
the cimny variable terras 340 .. and If the observed soil type for pixel i is class 1, 

then the categorical beta th£t will bemused is § 3 , whereas if the: soil type is class 2 , then §4 
will be used. This is acconplished by defining D$\ equal to 1 if the soil type in pixel i is 
class 1 and 0 otherwise and by defining 0 .^ equal to 1 if the soil type is class 2 and 0 otherwise. 
It 1 s a relatively sicple task- ta expand this model to include several (polychotcmous) soil types 
or to include other categorical data. 


When tbjre is more than one interval scale dependent variable, the model is called MANOOVA, 
or Multivariate Analysis of Covariance. In this model, all of the independent variables are 
regressed against each of the dependent variables, with separate tf 2 s and F ratios calculated for 
each dependant variable. An exanple of this is: 


m §0 + §1 (MSSi) + +- § 3^1 %uPiz 

where §£ is predicted bianass for pixel i, Si is predicted soil loss for pixel and the: other 
variables remain as in the preceding exanple. IIANOOVA can be seen as a device to test more than 
one ANCOVA model in the same statistical analysis. 


2.1.3 CELL (c) — CATEGORICAL EXPLANATORY VARIABLES. The extended regression model examined in 
cell (b) is also applicable here. When the explanatory variables are all meas ured on the ordinal 
or nominal scales , the model is usually called Analysis of Variance (ANOVA) . All that is neces- 
sary to move from cell (b) to (c) is to exclude from the analysis all explanatory variables that 
are measured on the interval or ratio level. As an exanple of ANOVA, die biomass model would be 
written? 


m 3 q + ^ “h 23JV3 +■ §4 &£u 

Here, 3 \ and §2 refer to two different soil types and 33 and 84 refer to rainfall that has been 
categorized into two levels (high and low) • It would be possible to categorize MSS data for use 
in such a model, but there is considerable information loss. Consequently, the ANOVA or MANOVA 
(ttiltivariate Analysis of Variance) model is not Likely to be as useful as the ANCOVA or MANCOVA 
model. 


2.2 CATEGORICAL RESPONSE VARIABLE 

The first three cells have dealt with Landsat MSS data and continuous dependent collateral 
variables as predictors in various statistical models of physical phenomena. The last three 


4 


V7' 


1 


/V- 


i 


( 

L 


(. 


i 

l- 

i 

I 


cells, In which the dependent verisble is categorical, open up new ways of utilising collateral 
data within Che Landsat structure, With categorical data and with the ap p ro p ri ate statistical 
techniques (see Figure 1) it is possible to: 


1) 

2) 

3) 


model physical and social phenomena that are best represented in discrete steps -- low, 
mdiun, high (soil erosion, fire hazard, biomass, housing quality, municipal, services, 
etc.); 

classify the dependent variable into nominal groupings (land use, vegetation comnjnH.y 
type, etc.); 

create predicted probabilities that the dependant variable will assune particular caco- 
gories and use these probabilities to classify an image directly or use than In con) mo- 
tion with other data (usually MSS) in a Bayseian madam likelihood classifier. 


Ihlika the first rw of Figure 1, the second row includes five different statistical techni- 
que.! . For purposes of modeling with Landsat under the constraints of the second row, Madam 
Likelihood Classification (MLC) with Prior Probabilities is the most Important method. "Mad- 
am likelihood" is a statistical property of an estimator, and, used in its proper way, inpliaa 
that an estimator has the highest probability of producing the data which were used for calibra- 
tion. Hcxwver, its use in remote sensing implies a particular decision rule (MLC) possessing 
this property, which is discussed below. Our research has shorn that the inclusion of collateral 
data as prior probabilities to MLC offers a simple and affective way of conbining collateral and 
roootely sensed data. 


The key to the use of prior probabili ti es is the logit regression model, v*\ich takes col- 
lateral data (continuous or categorical) and generates predicted probabilities , These predicted 
probabilities can be used as. a direct classifier (i.e. , the pixel is classified into die cate- 
gory chat has the highest probability) , but the most likely usage of these predicted probabili- 
ties is as input to the MLC decision; rule, in which they serve as weights. Since the logit re- 
gression model is probably the least familiar of the statistical techniques to remote sensing 
research, and since it is applicable in awry cell in the bottom raw, it will be expiated with 
the most detail. _ __ 

Discriminant Analysis is similar in its usage to maxima likelihood with prior probabilities 
but because of its computational coop laxity it has not been used often in the raooce sensing con- 
text. Consequently, it will be only briefly discussed. Chi-Square Analysis , which is the tra- 
ditional analysis used on contingency tables (all variables are categorical) is by definition 
not cocpa tibia with interval or ratio scale remotely sensed data. It is similar to ANOVA, 
except that the output is categorical. 

2.2.1 CELL (d) — CONTINUOUS EXPLANATORY VARIABLES. In the past ten years, nsudmun likelihood 
classification has found wide applica ti on in the field of remote sensing. Based on oultivariate 
normal distribution theory, the MLC algorithm has been in use for applications in the social 
sciences since the lace 1940’s. Providing a probabilistic method for recognizing similarities 
between individual measurements and predefined standards, the algorithm found increasing use in 
the field of pattern recognition in die following'' decades XChow, 1957; Sobestyei , 1962 ; ' Nilsson, 
1965) . hi remote soising, the development of nuXtispectral scanning technology to prodice 
layered aultispectral digital images of laid areas from aircraft or spacecraft provided the 
o p p ortunity to use the sexism likelihood classifier In producing thematic classification maps 
of large areas for such purposes as land use/ land cover determination and natural cultivated' 
land inventory (Schell, 1972; Reeves ec al., 1975). 

Before presenting a practical example, it will be helpful to briefly review the mathanatics 
of the marium likelihood decision rule. In the sultivariate remote sensing application, it is 
assisted that each observation X (pixel) consists of a sat of measurements on p variables (chan- 
nels) . Through seme external procedure a set of observations which correspond to a class is 
identified -- that is, a set of similar objects characterized by a vector of means on measure- 
ment variables and a variance-covariance matrix describing the Interrelationships among the 
measuremoit variables which are characteristic of the class . Although die par am etr i c mean vec- 
tor and dispersion matrix for the class remain unknown, they are estimated by the sample means 
and dispersion matrix associated with the object sasple. 

Multivariate normal statistical theory describes the probability that an observation X will 


5 




\ 


m* i *** fm 


1 


occur, given chat it belongs to a class k, as the following function: 

- (2n)*^! sJ'V* " - »*) 


As applied In a mmdnun likelihood decision rule, the previous expression allows the calculation 
of the probability that an observation is a usher of each of k classes. Ihe observation (pixel) 
is then assigned to the class for which the probability density value is greatest. Since the 
log of the probability is a manotcnic increasing function of the probability, the decision can 
be nade by cooparing values for each class as calculated from the right hand side of equation, 

A sitplifiad remote sensing classification rule using mcciaiin likelihood (Tatsuoka, 1971; 
Strahler, 1980) with k possible categorical classes and p channels of t-ES input datasets is to 
choose the k (class) which minimizes 

* l.fcty) “ InlPfcl + <% - - mfc). 

This expression is derived from the preceding one by taking the natural logarithm aad deleting 
terms which are constant for all classes. 


Interval or ratio level collateral data can be incorporated as extra "logical* 1 channels 
within this model. One successful forestry application was achieved by the creation of a tex- 
ture channel which was synthesized frcm Landsac Band-5 by taking the standard deviation of den- 
sity values within a 3 by 3 moving window, scaling this value, associating it with the center 
pixel of the 3 by 3 window, and returning it in image format (as a fifth channel) . Values in 
the texture channel describe the variation in image tone within the irradiate area of each pixel. 
High values are characteristic of edges and boundaries , whereas lower values describe more uni- 
form areas. This technique was shot*i to significantly increase classification accuracies for a 
nuifcer of species-specific forest cover types in nortfom California (Strahler, 1973, 1979) . 
Strahler (1973) demonstrated how to input collateral information in the form of elevation data 
and slope aspect (in ccobination with a texture channel) as separate "logical" channels, increas- 
ing the classification accuracy by 27 percent. 


2 * 211.1 


eit Bee 


(a) , (b) , and Zcf"co 
see Wrigley, 1976, p. 3-9; 1977b; 


’assign. In extending the conventional regression models adopted in cells 
problems of cell (d) , difficulties are encountered. (For details 
p. 12-13), First, a conventional regression model with a 


categorical response variable will violate the constant error variance or hcmoscedasticity 
assumption. While this problem does not result in biased or inconsistent parameter estimates , 
it does result in a loss of efficiency and gives rise to serious problems if conventional infer- 
ential tests are used. Secondly, a conventional regression model with a categorical response 
variable may generate predictions which are seriously deficient. It can be shown that the pre- 
dicted values of the response variable in such a model are best interpreted as predicted proba- 
bilities. The problem is that although probabilities are constrained to lie within the range 
of 0 to 1, the predictions generated from such a model are unbonded and may take values frcm 
TgLrms infinity to plus infinity. Thus, the predictions may lie outside the meaningful range of 
probability and may be inconsistent with the probability interpretation that was ^ust presented. 
The simplest yet most statistically sound solution to the probability problem (within a regres- 


sion framework) is the logit transformation, in which the probability F . is modeled as 


P, 


m 


ts 


1 4 * 0 


w 


and the probability of "not i3: 


1 


o 

4 i 


i + 


where BX is the vector product of betas multiplied by row vector of X's (observed explanatory 
variables). Although these two equations are nonlinear models, it is a simple matter to rewrite 
than as 


D. 

‘i 


The logic transformation is achieved by caking Che natural logarithm of che preceding fornula, 
which yields 


P. 

In — - — - &x 
i-p { ■■ 

this trarw format icn has che property of increasing frem minus infinity to plus infinity as 
increases from 0 co 1. Once efficient estimators are calciilated for che betas, simple algebfa 
will extract che value of The method can be generalized to 'k classes, in which there are 
k - 1 logics of the fora • 


ln<fi), ln(£i),.,.,ln(|i.) 

Each logit oust be modeled separately , producing k - 1 sets of betas , As in the binary case 
described above , algebra will extract the values of the probabilities from the k - L logits pre- 
dicted for an observation along with the constraint that all probabilities oust sun to one. 

Unlike the conventional regression models of the previous cells , all of which can be effi- 
ciently estimated by the ordinary least squares (OLS) method, the logistic and linear logit re- 
gression models appropriate for the problems of cell (d) require either a weighted least squ ar es 
(WLS) estimation procedure or a maxintm likelihood procedure. The choice between the two methods 
depends upon whether the calibration data include repeated observations for each combination of 
values of the explanatory variables (in the case of VLS) or not (in the case of maxima likeli- 
hood) , Since such replication are unlikely in calibration of the logit model for remotely 
sensed data, maxima likelihood estimation is preferred. (Note that here the term "maxima 
likelihood" refers to a p a r a meter estimation method different from nultivariate normal MIC.) 

The maximm likelihood solution to che calibration of the logit model has many attractive features. 
It can be shown that provided the sample data are not nulticollinear, a unique maxima likelihood 
estimator can be obtained even in relatively small sanples. Also, the mathematical p roperties 
of the likelihood fucvidon allow for efficient computer programs to produce the parameter esti- 
mates, and these estimates are consistent and are the best possible estimates in very large son- 
pies. The disadvantages of the procedure are that it involves numerical optimisation and there- 
fore more costly computation, and that it is a less familiar statistical technique. For a de- 
scription of the procedure used to calculate such maxi, mm likelihood estimates, see Cox (1970, 
p. 37), Mantel and Brown (1973, p, 654-5), Wrigley (1975, p. 191-3), Docencich and McFadden (1975, 
p. 110-12) and Schmidt and Strauss (1975a, p. 434-5) . 

There are two ways the logit model can be used within the context of remote sensing. The 
first is to act directly as a non p ar ame t r i e classifier (i.e., to classify a pixel into the class 
having the highest predicted probability) . The second is to use collateral data in a logic model 
to predict prior probabilities and input them to a MLC decision rule which accepts prior proba- 
bilities. This approach effectively combines a nonparaznetric logit model for collateral data 
with a parametr i c MLC model; it will be discussed in a following section. The following example 
of. the logit model involves the direct classification of land use by MSS and terrain data. The 
model is 


« Hx 

In * 8q + 3i(^ond5^) +* 

1 - H i 

where is the probability that pixel i is class 1 and the follcwing terns indicate linear com- 
bination of spectral and collateral data. If desired, the model can easily be expanded to all 
four bands and all the continuous collateral variables relevant to the classification. 

2. 2.1.2 Discriminant Analysis . Discriminant analysis is a nultivariate technique used to 
produce sets of mcorr elated hxetiens which separata observations tost efficiently into predesig- 
nated groups. A discriminant model, or classification criteron, is developed, the values of 


which dsfins groups for cht observations. Th« individual observation is classified into one of 
the previously defined groups by a mtssure of generalised squared distance. 

This technique requires soca difficult conqxieation. Deriving the classification criteria 
requires extracting eigenvectors from the nonsynmetric M’ l B mstrix, where B and W are respec- 
tively, the between aid within groups sun of squares and crossproducts matrices . The oethe- 
metics of this process are beyond the scope of the paper (please refer to Tatsuoka, 1971; Cooley 
and Lohnas, 1971) . The technique is helpful in the social sciences for identifying variables 
which do the best job of separating classes, but is not usually used to process new data for 
classification. Further, the technique assums phs»t all classes possess an identical dispersion 
matrix, an assuipcicn unlikaly to characterize ranotely seised data. 

2.2.2 CHI. (o) — MIXED EXPLANATORY VARIABLES. Conceptually no new problems are encovnterad in 
moving frem cell (d) to cell (e) ; categorical explanatory variables are included through dumsy 
variables. The logit nodal has high potential for application in remote sensing. Since it is 
capable of incorporating both bontinuous and discrete input data and generating probabilities 
either directly for classification or as input as a collateral data set of probabilities to MLC 
(Strahler, 1979) , In other words, interval level measured date such as rainfall, elevation, 
slope, etc. (no matter what its variance-covariance) can be cccbined with discrete data such as 
soil type, previously classified land use, census tracts, etc., in a no np a r am er i c, Logit frame- 
work ana Che result will be a discrete output such as a land use map, a soil credibility map 
(divided into discrete levels) or other special use maps. 

The logit model, as In the previous cell, can be easily extended to cell (e) . Again, land 
use could be predicted by 

. °il . . „ . . 

In - * So + 1 Bi(bard5A + §2 (.alopi-i) + 83^1 + Si fiiz 

l - Hi 

where Hi is Che probability of land use '1' for pixel i over the probability of all the land 
uses chat are not '1' , band5i is the MSS value for pixel i, alop*; is slope of the pixel, and 
Oil is a dunmy variable with a value of 0 if the previous land use on pixel i was class 1 and 0 
ochetwisa and D {, 2 has the value 1 if the Land use on pixel i was class 2 and 0 otherwise. 

Proceeding paragraphs have referred to the use of prior probabilities in modifying the out- 
come of a MLC. Since the prior probabilities can be modeled to reflect both continuous and cate- 
gorical input data, and the input of MLC is a categorical calssificadon, this tehaique is ap- 
propriate to discussion of cell (a) . Prior probabilities are incorporated into the classifica- 
tion through the manipulation of che Law of Conditional Probability. Hie acutal derivation of 
the prior probability is beyond the scope of this paper (see Strahler, 1980) . The modified 
decision rue is to choose k which minimizes 


F 2 - ln|M> - «fe)*‘n£ l 0V " °*> * H**<«*> 


where the cnly difference between this f anoxia and the ore presetted in cell (d) is the probabil- 
ity tern, -21nP(fy) . This form of the decision rule is usually attributed to Tatsuoka and 
Tiedaan (1954; Tatsuoka, 1971). 

It is inportant to understand how this decision rule behaves with different prior probabil- 
ities. If the prior probability F(6t,) is very small, then its natural logarithm will be a large 
negative rusher; when multiplied by -2, it will become a large positive number and thus F z,^ for 
such a class will never be minimal. Therefore, setting a very smell prior probability will 
effectively renove a class from the output classification. Note that this effect will occur even 
if the observation vector X.* is coincident with class mean vector m^. In such a case, the quad- 
ratic proctet distance function (X/ - n%) ’D^Xj* - n^) goes to zero, but the prior probability 
term -21nP(9^) can still be large. Thus it is entirely possible that the observation will be 
classified into a different class , one for which the distance function is quite large . 

As the prior probability P(9t,) bee ernes large and approaches 1, its logarithm will go to 
zero and will approach F\ t j, for that class. Since this probability and all others must sun 
to one, however, the prior probabilities of the remaining classes will be small timbers and their 


valu«* of F 2 ,j. will ba greatly augmented. The efface will be to force classification Into the 
class with high probability, Therefore, the more extreme era the values of the prior probabili- 
ties, the less important are the actual observations values Xy. 

For a mmsrical example of how prior probabilities can affect the decision of the mexinun 
likelihood classifier, please refer to Strahler, 1980, There are so mery potential applications 
of prior probabilities and the naxhiun likelihood decision rule that it would be couiterproduc- 
civa to list them all. In general, all data thee is relevant to a classification nodel can now 
be incorporated and this process has bean shown to significantly increase classification accura- 
cies (Strahler, 1973; Strahler, 1980), 

The versatility of the prior probability techniques cones about vrtnan the priors are allowed 
to vary on a pixel-by-pixel basis. The priors for a pixel nay be determined by a logit or ocher 
model, or by using a sec of class-conditional prior probabilities estimated by stapling, Because 
the priors are ccnputed separately, it is possible to ndx any sort of model estimating prior pro- 
babilities with a multivariate nor mal MLC algorithm which is known to be well 3idced to most 
spectral data. Thus, the technique allows easy, flexible merging of collateral data, used to 
predict the priors, with continuous image data. These points are discussed in more length in 
Strahler (1978) . 

2.2.3 CHI, (f) - CATEGORICAL EXPLANATORY VARIABLES. Data that falls into this level has been 
traditionally analyzed by contingency cable analysis with Chi-Square methods. But statistically 
speaking, it is a sizple matter to extend the logit model of cell (e) to cell (f) through the 
use of dunrny variables. For data that only com in nominal or ordinal levels, the logit model 
offers new aid important insights into the data (Wrigley, 1979; Theil, 1970; Grizzle, Scanner 
and Koch, 1969; Koch ec al., 1971, 1972, 1976a, 1977; Landis and Koch. 1977; Lahnen and Koch, . 
1974a, 1974b) . As in earlier discussion, the logit model can serve directly as a nonparamebric 
classifier, using only categorical variables input as dumy variables. In this form, the logit 
model is equivalent to a log linear model of a contingency table; such models are discussed fully 
in such text3 as Bishop ec al. (1975), 

The categorical logit nodel is f emulated in the exsple below: 

? 1 £ 

In , m §o + +• ijPzi 

L 

where there are two output categories — 1 and not 1 — which are modeled by categories of soil 
Type as described in cell (e) . The classifier simply assigns the output pixel to the class with 
the higher probability. 


3. APPLICATIONS 

'Ago statistical modeling techniques, logit modeling and maximm likelihood classification 
with prior probabilities, were selected for further investigation in the context of a real appli- 
cation. A linear logit model .was devised and fitted to forest species coopositional data for 
northern California, predicting the proport i on of timber voluae for each of five coniferous spe- 
cies at each pixel based on registered terrain data quantifying elevation and slope aspect. 
Maxisun likelihood classification with prior probabilities was tested in Ventura County, Cali- 
fornia in a land use application. A previous Lands at classification and an externally devised 
transition probability matrix ware used together with new Landsat image data to prodixe an up- 
dated classification consistent with the observed pattern of change. 

Throughout this research we have utilized the Video Image Ccnndnication and Retrieval 
(VICAR) system and the Image Based Information System (IBIS) resident at UCSB. VICAR/ IBIS, de- 
veloped at the Jet Propulsion Laboratory (JPL) at Pasadena, California, USA, is a job control 
Language which permits the sequential linking and execution of a vast array of Fortran and 
Asssrbler routines in a batch environment. In addition to extensive usage of existing VICAR/IBIS 
routines, new VICAR and non-VICAR software were developed and/or modified as required for the 
purposes of this research. 


3.1 LOGIT MODELING OF SPECIES PROPORTIONS 


If 


Logie modeling of species pr o p ort ion* used die* d*riv*d from eh* Klamath National Forcse, 
located in northern California, USA, (Figure 2) . Ranging in relief from 500 eo 0,000 feet, eh* 
Foreae includes 2,600 square miles of rugged terrain In the Si.sld.yau, Scoec Bar, and Salmon 
Mountains. Little of eh* are* ia developed beyond managemsnt for tiaber yield, livestock produc- 
tion, aid recreation. A wide variety of distinctive vegetative types is present in the ares, 
Foreae vegetation includes such coidfetous species as noble, red, white, and douglas firs, pen- 
dents* pine, and incense cedar, as wall as several oaks, aid typical species of chaparral. Thus, 
the topographic and ^ege national characteristics of the area are wall differentiated. Within the 
KLsacth National Forest, a study area including nest of eh* Gooscnest Range was selected for 
logit modeling of species cenposition from terrain features. This area waa chosen because cali- 
bration data and Landsat images were readily available for it. 

The logit model devised for this forestry application raquixes preparation of digital ter- 
rain data. Thaae data, obtained from the National Cartographic Information Center, in Renton, 
Virginia, USA, are derived from processing of 1:250,000 contour taps, and include elevations at 
every point on a grid of approximately 65 m spacing. Although the data are comparable in scale 
to a Landsat image, the elevation values are quite generalised because they are produced from 
smell scale contour map* by interpolation (Figure 4) . 

Slop;* —igl* and slope aspect channels can be pro duc ed using the elevation data of the regis- 
tered terrain image. Although a mafcar of slope end aspect generating algorithms are known, 
the slop last is the fitting of a least squares plane through each pixel and its four nearest 
neighbors and the calculation of the downs lop* angle and direction of the plane. Slope angle is 
obtained relative to the mmeric range of the elevation channel and image grid spacing, and azi- 
nuth is determined with respect to the rectangular image grid. These channels are best generated 
directly during the preprocessing of the original terrain data. At that time, slop* angles and 
aspects can be calculated from half-word absolute elevations arrayed in a north-south east-west 
grid. 


The slope aspect image (Figure 5) consisted initially of gray tone densities between 0 
(blade) and 255 (white) which indicated die azimuth of, slope orientation, ranging clockwise from 
0° to 359° . These values were then transformed according to the function below: 

nmidm - 3.0 + 126,0*(1.0 + cos (. 024933275 * bidden - 26.1))) 

where olddan symbolizes the old (azisuth-keyed) gray tone pixel density value, neuden symbolizes 
its transformed value, and the argument of the cosine function is expressed in radians. This 
function transforms density values according to an orientation proposed by Haroung and Lloyd 
(1969) . Since northeast slopes present Che most favorable growing environment, and southwest 
slopes the least favorable, with northwest and southeast slopes or neutral character, the den- 
sity ton* azisuths were rescaled by a cosine function, with 3 representing due northeast and 255 
representing cue southwest. Neutral slopes, oriented northwest or southwest, thus received den- 
sity tones near 128, Flat pixels were coded with zeroes. The function also comets automati- 
cally for the 12° skew of the Landsat image. 

The logit model fitted is shown below: 

Pj, 

Wprf - Bi>* + 02,** + §3,^ + * « 1.5 

where Pfc is the probability that the board-root of tinker volume will be drawn from one of five 
species' is the probability chat the board-foot will not be drawn from species k, E is ele- 

vation (compressed to 0-255 range), A ia aspect transferred as described above, 5 is slope angle, 
and are the estimated regression constants. Note that five equations, one for each 

species? actually cocprise the model. The nodal was calibrated using 73 measurements of timber 
volt me prepared by the U . S. Forest Service and located within two subregions of the Goosenest 
range. These sanples are probably not representative of the entire area modeled, but serve for 
the demonstration purposes of this research. Each sanple was located on 1:15,340 scale color 
air photos and tranferred to Band 5 of the Landsat images to obtain the line and s®ple coordi- 
nates of the sample point. The coordinates then used to extract elevation and aspect values 
for the sarrple from the registered elevation and aspect images. The coefficients for the model 
were f itted~ by a nonlirear optimisation algorithm employing the Newton-Raphson method to select 


10 


coefficient* with madam likelihood. 



Given the constants produced by the procedure discussed above, the probability Images were 
created using the new VICAR program "FROBiAPS," PROBiAFS, written specifically for this appli- 
cation, calculates the probability of species k for each pixel using the following expression; 


j. - vtoare - exp <3i,& + h,& + h t!< S) 



PRDEMAPS chen scales each probability so that the range 0-255 represents 0. to 1. PRDBMAPS out- 
puc images for this example are shown in Figures 6-10. Brightness values in Figurss 6-9 reprs- 
sent probabilities of occurrence for douglas fir, ponderosa pine, white fir, and red fir, re- 
spectively, with probabilities scaled to ranga from black (0.) to white (1.0). Figure 10, incense 
cedar, haa been contrast 3trecchad for display purposes, and presents a probability range of 0, 
to .3 from black to white. The probability images represent maximal likelihood estimates of 
species proportions; they appear reasonable in light of the known ecological prr ' i rences of the 
species, but chair accuracies remain to be determined. 

The probability images produced by PFOEMAPS can be thought of as predictions of the propor- 
tion of the tinker volune expected for e»di species at each pixel, This view iaplies a continu- 
ous mixture of species, constacly varying in response to elevation and slope aspect, An alter- 
native view is that forest stands are monospecific, and chat each pixel is dominated by a parti- 
cular species or forest cover type, £3 such a case, the modeled values are probabilities that 
the pixel will be dominated by a particular species or stand type, and it is therefore appropri- 
ate to produce a single output image indicating the type with highest probability for each pixel. 
In this way, the logit modal can serve as a nanparamatric classifier. The probabilities can also 
be viewed as prior probabilities, and input to MLC of an image using spectral data. This pro- 
cedure amounts to mixing a nonpar ame eric model for collateral data (terrain charnels) with a 
par a me tri c nodal for spectral data (Landsac channels) . 

3.2 LAND USE CLASSIFICATION USING TRANSITION PROBABILITIES 

An additional objective of cur research was to apply tha method of maximal likelihood classi- 
fication with prior probabilities to a land use/ Land cover classification, (aee sections 2.2.1 and 
2.2.2). In this example, land use/land cover maps for two 7H-ndnute quadrangles in Ventura 
County, California, U.S.A. , were obtained from photointerprecaticn of high-level U-2 aircraft 
imagery for the years 1973 and 1976. These maps, with inherent accuracies considerably higher 
than those of Lands at classifications, were used as "groind truth" to construct a matrix or tran- 
sition probabilities, showing tha probability of change of classification for each land use/land 
cover type to each other type in the three year interval. With a 1976 Landsat image as input data, 
we planned to carry out MLC of each pixel using the 1973 U-2 derived cover class as a collateral 
data channel Indexing the transition probabilities ap pr op ri ate to the 1973 cover class. The 
resulting classification was to be conpared with the 1976 U-2 derived map to evaluate the accuracy 
of the technique. At present, we hove used image overlay techniques to create the transition 
probability matrix, but the classifications using fhe transition probability matrix have not been 
carried out. 

3.2.1. raOOTURE. Using the Jet Propulsion Lab’s (JFL) Image Based Information System (IBIS) and 
a coordinate digitizer, it is possible to merge image data in digital form with other types of 
geographic data. The IBIS is essentially a fine-mesh grid information system which is ccnpatible 
with the handling and storing of digital image data. By allowing a user to overlay thematic map 
and digital Landsat data (or pertinent Landsat -derived data) with IBIS , it is possible to derive 
the values that comprise the transition probability matrix, as well as determine the accuracy of 
the thematic Landsac classification data. 

Prior to IBIS processing, the two "ground truth" land use/land cover maps for the cwo- 
quadr angle study area were prepared through photointerpretation of NASA high-altitude color infra- 
red imagery with additional ground checks . A coordinate digitizer board was used to convert the 
maps to a series of digital coordinates. Polygons of thematic land cover categories 'were captured 
by digitizing overlapping line segments chat comprise such polygons. Polygon centroids were also 


11 


digitized and assigned an appropriate land uie/land covar label for later us* In converting the 
polygonal data eo raster (long* boat) fora. 

The digitized lira segnent data wart processed using IBIS as follcws, A modified van ion 
of cha IBIS program PCLYGEl converted cha coordinaca digitizer sepnent data into polygons in 
cha fora of «t IBIS graphics fila. Following chis reformatting, cha progran POLYREG rigidly 
rocacad cha polygons co conform wich cha Landsat dsca for cha case site, POLYSCRB was used co 
ccnvare cha polygon dsca into raactr fora (fine-rash grid) . Tha rasulc uaa an laaga of ras- 
terized polygon bordars reprassneing cha edges Co Charade land covar tnics. 

Tha next phaaa In Cha IBIS processing involvad cha aasigmnc of labals co cha rasterized 
polygon areas, An incazradiaca cacsgorizad image was automatically produced by cha program 
PAINT , which assigns an arbitrary but unique brightness maker co all cha petals within aach 
rasterized polygon. This output Iran was cartined wich cha polygon cancroids chac had also 
baan convarcad co an IBIS graphics fila format (wich V2P0LY) and rigidly cransfortntd co over- 
lay wich Cha polygon bordars (POLYREG) , Tha progran CTO1ATCH was usad co ascablish cha corre- 
spondanca babwaan centroid labals and polygon brighenass nurtures, and tha prograa STRETCH was 
used co reassign identical brightness values co polygons wich similar labels, yielding a raster 
foraac image corresponding directly wich Cha original land use/ land covar map, 


With both land usa/land covar images in digital image foraac, cha next seep is co overlay 
then using POLYOVLY Co analyze cha change occurring between cha two daces in order Co estlrate 
cha transition probability matrix. The output of POLYOVLY is a cable counting tha nuxber of 
pixels in aach cottfcination of classes across cha Cm images. If- denotes cha come of pixels 
in claw i of the first image, and class i of Che second', Chen transition probability % * 
for class i Co j is S i j Tha dlgidzad map of lS'fj *and usa/land covar was also" 

m - m ■ i 

-W?: 




usad co assess tha accuracy of M £ of Landsat data, again using cha POLYOVLY program. Accuracies 
discussed below ware obtained In this fashion. 


With a digitized 1973 land use/ land covar classification map produced from air photo inter- 
pretation new In hand as collateral data charnel registered to cha 1977 Landsat image, and with 
a transition probability matrix co provide sacs of prior probabilities contingent as a col- 
lateral data channel, it should be possible co carry cut maxinun likelihood classification using 
cha transition probabilities as prior probabilities indexed by cha 1973 classificatim. Future 
work includes performing cha classification, and overlaying ic on tha 1976 digitized map for 
accuracy analysis. Initial indications are chat accuracies will Increase with cha use, of cha 
1973 digitized map data, demons tracing cha successful use of Landsat to imdaca existing manually 
produced lmid usa/land cover raps. 

4. amusiCN 


Of the large ranker of statistical techniques which can be used to develop models combining 
remotely sensed images with collateral data in a cannon predictive framework, two techniques 
are of special interest for races sensing: logit modeling and madman likelihood classification 

wich prior probabilities. These methods allow cha cmstruction of nonpa r are al e classification 
models utilizing both Image and collateral data channels as well as tha mixing of par a met ri c and 
nonparsmetric classification models for image and .collateral data respectively. Both techniques 
have been successfully deems crated in application using Landsat imagery; each has Che potential . 
co greatly increase classification accuracy through Che use of collateral data, and each should 
find wide application in future research and development in remote sensing. 

5, REFERENCES 

Bishop, Y.M., Elenberg, S.E., and Holland, P.W. (1975), Discrete Miltivariace Analysis: Theory 
and Practice. MIT Press, Cartridge, Massachusetts. 

Blalock, H.M. (1972), Social Statistics . McGraw-Hill Inc,, New York. 

Chow, C.K. (1957), An optician character recognition system using decision functions, IRES Trans. 
Election. Canputers 6: pp. 247-254. 


3 i 


Cox, D. R. The analysis of binary data . 1970. London. Msthuian. 


Doowcieh, T.A. an d McFadden. D. Urfcou travel d amoid: a behavioral analysis, 1975. Aoseordaa. 
North-Holland, 

Graybill, F.A. (1961), An Ihtgodicdai Unaar Statistical Medals , Vol.l, McGrw-Hill Book C©», 
Now Ydrk, 

Crizzla, J.E., Starmsr, C,F. aid Koch, C.G. Analysis of categorical data by linear models. 
Biometrics 25, 1969, pp. 489-504. 

Hartumg, R.F., Lloyd, W.J. (1969), Influanca of Aspect on Foraaea of eha Clarksville Soils In 
Dane County, Missouri., Journal of Forestry. 67, pp. 178-132. 

Koch, G.G. , lossy, P.B. and Relnfure, D.W. Linear modal analysis of categorical data with 
incomplete response vectors. Biometrics 28. 1972, pp. 663-92. 

Koch, G.G., Freeman, J. L. and Lehntm, R.G, A general msthodolosy for the analysis of ranked 
policy preference data. International Statistical Review 44, 1976a, pp. 1-28. 

Koch, G.G,, Landis, J.R., Freeman, J.L., Freeman, D.H. and Lehnen, R.G. A general methodology 
for the analysis of experiments with repeated measurement of categorical data. Biometrics 
33, 1977, pp. 133-58, 

Landis, J.R, and Koch, G.G, The measurement of observer a g r e eme n t for categorical data. Bio- 
metrics 33, 1977, pp. 159-74. 

Lehnen, R.G. aid Koch, G.G. A general .linear approach to tha analysis of nonmstrlc dita: 

applications for political scianca. American Journal of Political Science 18, 1974a, pp. 
283-313. 

Lehnen, R.G. and Koch, C.G. The analysis of categorical data from repeated measurement 
research designs. Political Methodology l. 1974b, pp. 103-23. 

Mantel, N. and Brown, C. A logistic re- analysis of Ashford and Snowden's data on respiratory 
synptema in British coal miners . Bi o met ri cs 29. 1973, pp. 649-65. 

Morrison,, (1967), Miltivariata Statistical Methods. McGraw-Hill Book Co., New York. 

Nilsson. N.S. (1965). Learning Machines - Foundations of Trainable Pattern - Classifying Sys- 
tems, McGraw-Hill Book Co!, New' York: 


Reeves, R.G., Anson, A., and Landen, D. (1975), Manual of Ronote fusing’, Amur. Sdc. of Rioto- 
gramnscr/, Falls Church, VA, 2 vols., 2144 pp, 

Schell. J.A. (1973). in Remote Sensing of Earth Resources, Volume I (F. Shahrokhi. Ed.), Uni- 
versity of Tetmessei~^5c e I ns ' t l tut^ T UlfaH o ra , ~' TU , J r~57 £ 7 39 4 . 

SchnLdt , P. and Strauss, R.P. The prediction of occupation using nultiple logit models. 
Internatimal Economic Review 16, 1975b, pp, 471-36 . 

Sebestyen, G. (1962), Decision-Making Processess in Pattern Recognition , MacMillan, New York. 

Scr abler, Alan fi. , T.L. Logan, and N.A. Bryant (1978) . Incroving forest cover classification 

accuracy from Landsat by incorporating topographic information: Proceedings of the Twelfth 
International Symposium on Remote Sensing or the Bnvirotment . pp. 927-9^2. 

Strahler, Alai H. , and T.L. Logan, and C.E. Woodcock (1979). Forest classification and inven- 
tory system using Landsat, digital terrain, and ground sanpla data: Proceedings of the 
Thirteenth International Symposium on Remora Sensing of the Environment , pp. 1541- 1557 ■ 


13 


Ser abler, Alan H, (1980) , The use of prior probabilities in mexisua likelihood classification 
of r ere cely seised dec*: submitted for publication. 

Tacswka, M.M. , Multivariate analysis: Techniques for Education and Psychological Research. 

1971, John Wiley 5. Sons, New York. 

Tacsuoka, M.M. and Tiedeeun, D.V, Discriainianc Analysis Review of Education Research, 25, 

1954, pp. 402-420, 

Winer. B.J. (1971). Statistical Principles in Experimental Design. 2nd ed. , McGraw-Hill Book Co,, 
New York. 

Wrigley, N. (1975), Anal' ’ting multiple alternative dependent variables, Geographical Analysis 
7 , pp . 187-95 , 

Wrigley, N. (1976), An Introduction to the me of logic modals in geograp hy. Concepts and tech- 
niques in modem geography, 10, Norwich: Geo Abstracts Ltd, 

Wrigley, N, , (1976b), Probability surface mapping: a new approach to erent surface mapping. 
Transactions of the Insticuta of British Geographers . New Series 2, pp, 129-40. 

Wrigley, N. (1979) , Development in the statistical analysis of categorical data — a review, 
Progress in Honan Geography . 3, pp. 315-355. 


Table I. Notation 


1SRM 


DETOJmCN 


1 Vector of estimated dependent variables: where . signifies a vector 

and A signifies an ascimacor, 

i,s Regrtaaion notation; vector of estimated betas and vactor of observed 

* * e rror cerma, 

a Nvnber of maasurenent variables used to characterise each object or 

observation. 

X A p-diaensional randm vector. 

X. Vector of maasurenents cn p variables associated with the £th object 

1 or observation: £-1,2, , . ,*. ,N. 

9 . Mother of the >.th set of clashes 9; >,-1,2,. . , ,X. 

K' 

P(9, ) Probability that an observation will be a tarter of class 9,,; prior 

14 probability for class 9^. 

(X,) Probability dmsity value associated with observation vector X as 

< v evaluated for class k, 

y. Paramedic mean vector associated with eh* >.eh class, 

m. Mean vector associated with a sample of observations belonging to 

'• the >.th class; taken as an estimator of Uj,. 

- Parametr i c s by p dispersion (variance-covariance) matrix associa- 

* ted with the >,th class . 

D. s by ’p dispersion matrix associated with a sasple of observations 

< belonging to the ; :th class; taken as an estimator of h,. 


TirPUT CKAdlJIKtiS [tndependen? V&r labials)] 
* 

output CKA Nm, 

[Dependent Variable] 



Continuous 

Mixed 

Categorical 


Regrer ;on Models 

Analysis of Covariance 

Analysis of Variance 

Continuous 

- linear 

Multivariate Analysis 

Multivariate Analysis 


- curvilinear 

of Covariance 

oi'* Variance 


Maxinun Likelihood 

Maximum Likelihood 

Contingency Table 


Classification 

Classification vith 

Analysis 

Categorical 

Logit Modeling 

Prior Probabilities 



Discriminant Analysis | 

Logit Modeling 

Logit Modeling 


Figure l. Techniques for Cacbining Continuous and Categorical Data 
(modified from Wrigley, 1979) , 



Klamath Notional Foitst list Silt Locution 

























r ? ScfeMWt M* MM**** » 


The Use of Prior Probabilities in Maximum Likelihood 
Classification of Remotely Sensed Data 


ALAN H. STRAHLER 

University of Californio, Santa Barbara, California 


The expected distribution of classes in a Anal classification map can be used to improve classification accu- 
racies. Prior information is incorporated through the use of prior probabilities — that is, probabilities of 
occurrence of classes which are based on separate, independent knowledge concerning the area to be 
classified. The use of prior probabilities in a classification system is sufficiently versatile to allow (1) 
prior weighting of output classes based on their anticipated sizes; (2) the merging of continuously varying 
measurements (multispectra! signatures) with discrete collateral information datasets (e.g., rock type, soil 
type); and (3) the construction of time-sequential classification systems in which ail earlier classification 
modifies the outcome of a later one. 

The prior probabilities are incorporated by modifying the maximum likelihood decision rule employed in a 
Bayesian-type classifier to calculate aposteriori probabilities of class membership which are based not only 
on the resemblance of a pixel to the class signature, but also on the weight of the class which is estimated 
for the final output classification. In the merging of discrete collateral information with continuous spec- 
tral values into a single classification, a set of prior probabilities (weights) is estimated for each value 
which the discrete collateral variable may assume (e.g., each rock type or soil type). When maximum 
likelihood calculations are performed, the prior probabilities appropriate to the particular pixel are used in 
classification. For time-sequential classification, the prior classification of a pixel indexes a set of 
appropriate conditional probabilities reflecting either the confidence of the investigator in the prior 
classification or the extent to which the prior class identified is likely to change during the time period of 


interest. 


Introduction 

In the past ten years, maximum likelihood classification has found wide application 
in the field of remote sensing. Based on multivariate normal distribution theory, the 
maximum likelihood classification algorithm has been in use for applications in the social 
sciences since the late 1940’s, Providing a probabilistic method for recognizing similarities 
between individual measurements and predefined standards, the algorithm found increas- 
ing use in the field of pattern recognition in the following decades (Chow, 1957; Sebes- 
tyen, 1962; Nilsson, 1965). In remote sensing, the development of multispectral scanning 
technology to produce layered multispectral digital images of land areas from aircraft or 
spacecraft provided the opportunity to use the maximum likelihood criterion in producing 
thematic classification maps of large areas for such purposes as land use/land cover deter- 
mination and natural cultivated land inventory (Schell, 1972; Reeves, et al,, 1975). 

In the last decade, research on the general use of classification algorithms in remote 
sensing has centered in two areas: (1) computational improvements in evaluating max- 
imum likelihood and discriminant function decision rules; and (2) the use of various 
unsupervised clustering algorithms to extract repeated or commonly occurring measure- 
ment vectors which are characteristic of a particular multispectral scene. Computational 
improvements have included such developments as look-up table schemes (Schlien and 
Smith, 1975) to reduce repeated calculation, and hybrid classifiers (Addington, 1975) 
which use parallelepiped algorithms (Goodenough and Schlien, 1974) first, then turn to 
maximum likelihood computation to resolve ambiguities. Although important for small 
image processing systems, further computational improvements will become less and less 
cost effective as real time computational costs continue to fall through the development of 
fourth- and fifth-generation hardware computer systems. 




r: »4W»I 


.****&«. |U.i 


jro 


3 


Unsupervised methods rely on clustering measurement vectors according to some 
set of distance, similarity, or dispersion criteria. Many clustering heuristics have been 
devised and applied in image processing. Dubes and Jain (1976) provide a review and 
comparative analysis of a number of techniques which arc commonly applied in pattern 
recognition. However, as Kendall (1972, p. 291) points out, clustering is a subjective 
matter to which little probabilistic theory is applicable. No clustering algorithms have as 
yet come to the fore which can incorporate prior knowledge in a formal fashion (except 
for the use of a priori starting vectors in interactive clustering) with an expected incre- 
ment in class identification accuracy produced by the use of this additional information. 
However, recent developments involving guided clustering and automated labeling of 
unsupervised clusters blur the distinction between supervised and unsupervised tech- 
niques. Future work may well produce a continuum of intergrading methods from which 
a user can select a mix appropriate to the spatial, spectral, and temporal resolution of the 
data in hand and information output desired. 

The purpose of this paper is to show how the use of prior information about the 
expected distribution of classes in a final classification map can be used in several different 
models to improve classification accuracies. Prior information is incorporated through the 
use of prior probabilities — that is, probabilities of occurrence of classes which are based 
on separate, independent knowledge concerning the area to be classified, Used in their 
simplest form, the probabilities weight the classes according to their expected distribution 
in the output dataset by shifting decision space boundaries to produce larger volumes in 
measurement space for classes which are expected to be large and smaller volumes for 
classes expected to be small. 

The incorporation of prior probabilities into the maximum likelihood decision rule 
can also provide a mechanism for merging continuously measured observations (mul- 


lispeetral signatures) with discretely measured collateral variables such as rock type or soil 
type. As an example, consider an area of natural vegetation underlain by two distinctive 
rock types, each of which exhibits a unique mix of vegetation classes. Two sets of prior 
probabilities can be devised, one for each rock type, and the classifier can be modified to 
use the appropriate set of prior probabilities contingent on the underlying rock type. In 
this way, the classification process can incorporate discrete collateral information into the 
decision rule through a model contingent on an external conditioning variable. The 
method can also be extended to include two or more such discrete collateral datasets; the 
number is limited only by the ability to estimate the required sets of prior probabilities. 
Thus, prior probabilities provide a powerful mechanism for merging collateral datasets 
with multispectral images for classification purposes. 

Another application of prior probabilities contingent upon a collateral dataset allows 
temporal weighting in a time-sequential classification system. As an example, consider 
distinguishing between two crop types which, through differing phenologies, can be easily 
separated early in the season but are confused later on in the growing period. Through 
the use of prior probabilities, a mid-summer classification can “look backward’\to a 
spring classification to resolve ambiguity. Thus, winter wheat could be separated from 
spring wheat at mid-season by its distinctive early spring signature. This use of temporal 
information provides an alternative to the calculation of transformed vegetation indexes 
(TVI’s) and comparable procedures (Richardson and Wiegand, 1977) in the identification 
of crops with multitemporal images (Rouse, et al., 1973). Such a time-sequential 
classification system could also be used to monitor land use change. In this case, a 
Markov-type predictive model is used directly to set prior probabilities based on patterns 
of change shown in an area. 




31- 


5 


Review or Maxinitiin Likelihood Classification 

To understand the application of prior probabilities to a classification problem, we 
must first review briefly the mathematics of the maximum likelihood decision rule. For 
the multivariate case, we assume each observation X (pixel) consists of a set of measure- 
ments on p variables (channels). Through some external procedure, we identify a set of 
observations which correspond to a class — that is, a stii of similar objects characterized 
by a vector of means on measurement variables and a variance-covariance matrix describ- 
ing the interrelationships among the measurement variables which are characteristic of the 
class. Although the parametric mean vector and dispersion matrix for the class remain 
unknown, they are estimated by the sample means and dispersion matrix associated with 
the object sample. 

Multivariate normal statistical theory describes the probability that an observation X 
will occur, given that it belongs to a class k, as the following function: 

4> A (X,) = (2ir)“^|l*|- w (1) 

(Table 1 presents an explanation of the symbols used in this and other expressions.) The 
quadratic product 

X 2 » tt-zi^'Z^X-MA-) (2) 

can be thought of as a squared distance function which measures the distance between the 
observation and the class mean as scaled and corrected for variance and covariance of the 
class. It can be shown that this expression is a X 2 variate with p degrees of freedom 
(Tatsuoka, 1971). 

As applied in a maximum likelihood decision rule, expression (1) allows the calcula- 
tion of the probability that an observation is a member of each of A: classes. The indivi- 
dual is then assigned to the class for which the probability value is greatest. In an opera- 






- ■ xdkt *. m ■ . .. 


3b 

6 


lional context, we substitute observed means, variances, and covariances and use the log 
form of expression (!) 

ln[c|> A (X / )] - — Vip\n (2rr ) — Viln |X A .| — Vi (X/-*m A ) • (3) 

Since the log of the probability is a monotonic increasing function of the probability, the 
decision can be made by comparing values for each class as calculated from the right hand 
side of this equation. This is the decision rule that is used in the currently distributed 
versions of LARSYS and VICAR, two image processing program systems authored 
respectively by the Laboratory for Applications of Remote Sensing at Purdue University 
and the Jet Propulsion Laboratory of California Institute of Technology at Pasadena. A 
simpler decision rule, Rj, can be derived from expression (3) by eliminating the constants 
(Tatsuoka, 1971): 

Ri: Choose k which minimizes . 

(4) 

F U (X,) - ln|Dj + (X / -ni A ) , D,- , (X r in A ). 

The Use of Prior Probabilities in the Decision Rule 

The maximum likelihood decision rule can be modified easily to take into account 
prior probabilities which describe how likely a class is to occur in the population of obser- 
vations as a whole. The prior probability itself is simply an estimate of the proportion of 
the objects which will fall into a particular class. These prior probabilities are sometimes 
termed “weights,” since the modified classification rule will tend to weigh more heavily 
those classes with higher prior probabilities. 

Prior probabilities are incorporated into the classification through manipulation of 
the Law of Conditional Probability. To begin, we define two probabilities: P(<a A .}, the pro- 
bability that an observation will be drawn from class &> A ; and P(X,}, the probability of 
occurrence of the measurement vector X,-. The Law of Conditional Probability states that: 


* ~«« ***#*#- ♦ 




i » ,+i 0 » 


I 




3y 


r«M< 

7 


PMX,I 


P{to*.X,) 

“pIxF’ 


(5) 


The probability on the left hand side of this expression will form the basis of a modified 
decision rule, since we wish to assign the All observation to that class w* which has the 
highest probability of occurrence given the p-dimensional vector X, which has been 
observed. 

Again using the Law of Conditional Probability , we find that 


p(x,M 


Pjtot, XJ 

PM 


( 6 ) 


In this expression, the left hand term describes the probability that the measurement vec- 
tor will take on the values X ; given that the object measured is a member of class a> k . 
This probability could be determined by sampling a population of measurement vectors 
for observations known to be from class however, the distribution of such vectors is 
usually assumed to be Gaussian. Note that in some cases this assumption may not hold; 
as an example, Brooner et al. (1971) showed significantly higher classification accuracies 
for crops using simulated multispectral imagery with direct estimates of these conditional 
probabilities than with probabilities calculated according to Gaussian assumptions. How- 
ever, the use of the multivariate normal approximation is widely accepted, and, in any 
case, it is only under rare circumstances that sufficient data are obtained to estimate the 
conditional probabilities directly. 

Thus, we can assume that P{X,|a) J is acceptably estimated by d >*(X,) and rewrite 
expression (6) as 


<MX ( ) - 


PMxj 

pM ' 


(7) 


Rearranging, we have 


( 8 ) 




PUa.XJ-^CX.J'PIwJ-^W. 


Thus, we sea tluu the numerator of expression (5) can be evaluated as the product of the 
multivariate density function <I> A (X,) and the prior probability of occurrence of class o> A . 

To evaluate the denominator of expression (5), we note that for all A” classes the 
conditional probabilities must sum to l: 


X P{o> A |X,} ~ 1 ■» X 


A**l 


A«l 


Ffxj 


(9) 


Therefore, 


p{xj - 


(10) 




Substituting (8) and (10) into (5), 


PlwJXj 


* t (X f )-P{wJ 


K 


X‘I , a(X,) i I , {wa} 

A^l 


i,*kW 

A~ I 


(ID 


The last expression, then, provides the basis for the decision rule which includes 
prior probabilities, Since the denominator remains constant for all classes, the observa- 
tion is simply assigned to the class for which <I> A *(X,) , the product of d> A (X,) and P{<y A }, is 
a maximum. In its simplest form, this decision rule can be stated as; 


iu: Choose A which minimizes , . 

( 12 ) 

«-2 ,a(X;> - lii|D A ! + (X / ~in A )'l) A “ 1 (X / -mA)-2lnl»{ftJ A ). 

This form of the decision rule is usually attributed to Tatsuoka and Tiedeman (1954; 
Tatsuoka, 1971). 

It is important to understand how this decision rule behaves with different prior pro- 
babilities. .If the prior probability P{w A ) is very small, then its natural logarithm will be i* 
large negative number; when multiplied by —2, it will become a large positive number and 


thus v 2 ,*, for such a class will never be minimal. Therefore, setting a very small prior pro- 
bability will effectively remove a class from the output classification. Note that this effect 
will occur even if the observation vector X, is coincident with class mean vector n» A . In 
such a case, the quadratic product distance function (X / ~m A ) , Djr l (X,-nu) goes to zero, 
but the prior probability term -2InP(wJ can still be large. Thus, it is entirely possible 
that the observation will be classified into a different class, one for which the distance 
function is quite large. 

As the prior probability P{o» A ) becomes large and approaches 1, its logarithm will go 
to zero and f 2i a. will approach Fi f * for that class. Since this probability and all others must 
sum to one, however, the prior probabilities of the remaining classes will be small 
numbers and their values of F 2 ,/ will be greatly augmented. The effect will be to force 
classification into the class with high probability. Therefore, the more extreme are the 
values of the prior probabilities, the less important are the actual observation values X/. 
This point is discussed in more detail in a following section. 


Numerical Example 

A simple numerical example may clarify this modification of the maximum likeli- 
hood decision rule. For this example, we assume two classes e> i and &> 2 in a two- 
dimensional measurement space. Their means and dispersion matrices are shown below. 


nr, - 

14 2] 

D,= 

3 4 

4 6 

d 2 = 

4 5 

5 7 


n»2 ** [3 3] 


( 13 ) 


The determinants and inverses of these matrices are: 


For this example, we wish to decide to which class the measurement vector (4,3) belongs. 
To evaluate the probability associated with W| f we first evaluate the quadratic product 


X 2 *- (X*"iii|)'D“ , (X~mj) 


(15) 


X. 2 


10 


11 * 


3 -2 


0 

-2 ~ 


1 

i 

2 

1 



The probability density value is then 


*i(X) 


1 1 

27t V2 



.0532. 


Similarly, for the second class, 


(16) 


(17) 



Thus, the measurement vector (4,3) has a higher probability associated with membership 
in class than in class a> 2 > and it would be appropriate to classify the measurement into 


3 * 


It 


This same decision can be made by using the somewhat simpler decision rule R| 
(expression (4)): 


F U (X) - In | Dj | + (X- m|)'Df 1 (X— in,) 

(20) 

F u (X) ~ln(2)-f|-« 2.193; 

(21) 

F| >2 (X) = In 1 1> 2 | + (X-m 2 )'Dj' 1 (X-rn 2 ) 

(22) 

F| 2 (X) = In. (3) + 2 == 3.098. 

(23) 


Here again the decision is made to classify the observation X into wj. 

The foregoing calculations assume equal probability of membership in w, and o> 2 . 
Removing this restriction, we take prior probabilities into account. Assume that the fol- 
lowing prior probabilities are observed: 


P{o>i) = y P{cu 2 } = y . (24) 

Recalling the Rotation from expression (11) that <P/*(X,) denotes the probability density 
function adjusted for the prior probability, 

**(X,)-iMX/) (25) 


we calculate for the two classes 


4>f(X) = d>i(X)’P{6j,) - (.0532) -y = .0177 


4> 2 *(X) = <P 2 (X)-P{6i 2 } - (.0338) = .0225 
The actual conditional probabilities (expression (11)) then become 


PfwilX, (4,2)} mTJ+02 2S - 440 


(26) 

(27) 


(28) 


PWX,-(4,2)) 


.0225 

.0177+.0225 


.560. 


(29) 


Thus, the prior probabilities modify the outcome of the decision rule, favoring class &> 2 
for the observation (4,3) over class wj. 

In terms of the decision rule R 2 , we calculate 

F 2|l (X) -ln|D l | + (X~m l )'Dr , (X“iii 1 )*-2lnPl(w,} (30) 

f 2> ,(X)- F M (X)-2lnP{»|) (31) 

F 2i ,(X) - 2. 193 — 21n (-~) - 4.390. (32) 

For the second class, the outcome is 

f 2(2 (X) - 3.098 -21n( j) - 3.908. (33) 

Since the observation is classified into the class which minimizes the value of R 2 , once 
again the second class is chosen. 

Prior Probabilities Contingent on a Single External 
Conditioning Variable 

Having shown how to modify the decision rule to take into account a set of prior 
probabilities, it is only a small step to consider several sets of probabilities, in which an 
external information source identifies which set is to be used in the decision rule. As an 
example, consider the effect of soil type on the distribution of crops that are likely to be 
grown in an area. In such a case, a single suite of crops will characterize the entire area, 
but the expected distribution of crops from one soil type to the next could be expected to 
vary considerably. Under these circumstances, it would be possible to collect a stratified 
random sample of the area to be classified, in order to quantify two sets of prior probabili- 


13 


xo 


tics: one Tor the crops on the first soil type, the other for crops on the second. 

Thus, we introduce a third variable i\ t , which indicates the state of the external con- 
ditioning variable (e.g. soil type) associated with the observation. We wish, then, to find 
an expression describing 

(34) 

the probability that an observation will be a member of the class w A given its vector of 
observed measurements and the fact that it belongs to class v j of the external condition- 
ing variable. 

In deriving an expression to find this probability, we can make the assumption that 
the mean vector and dispersion matrix of the class will be the same regardless of the state 
of the external conditioning variable. This assumption is discussed more fully in a later 
section. The assumption implies that 


P(X,M « P{X,|ft>A,»v}, 

Expanding both sides of this relationship using the Law of Conditional Probability, 

P{X/,a> J V[X l ,<ti k ,Vj) 


(35) 


pM 


Solving for the 3-way joint probability, 


P{X„ 




PtX„n)-Pl^.i>/) 

PM 


(36) 


(37) 


Substituting expression (8) into the left hand term of the right denominator, 

cI»^(X ( )-P{ft.J*P(ft>A,./ / ) 


P(X,,a>*,M/} 


PM 


(38) 


P(X„<o A .,^} - <MX,)P [a k . Vj ) « <I> A ‘*(X f ). 
Expanding expression (34) according to the Law of Conditional Probability, 


( 39 ) 


(40) 


PUtlX,.*,) 


P{g) A ,X ( ,>>y) 
PlX,,^} * 


Noting that since all classes are included, expression (40), when summed over all classes 
must equal 1, 


£p WX„r y )-l- 

k - 1 


K 


IPlut.Xi.fjl 

k - 1 

PlX„*v} 


Rearranging, 


P{X/,Vy} “ 

X- 1 


Substituting expression (39) and (42) into (40), we have 


P{wfcjX;,l'y} 


<P k (Xj)'V[atk,vj) _ (X/) 

£ «MX,) *P{«*, V/) i 

fc-1 


(41) 


(42) 


(43) 


This result is analosous to expression (11); note that the denominator remains constant 
for all k, anti need not actually be calculated to select the class u t for which <I>j,"(X,) is a 

maximum. 

The application of this expression in classification requires that the joint probabilities 
?[u) k ,vj) be known. However, a simpler form using conditional probabilities directly 
obtained from a stratified random sample can be obtained through the application of the 
Law of Conditional Probability: 

P{ft)jt>v/} =■ Plta^Uyl’Pl^y}- ^ 

Since P{j 4 y) cancels from the numerator and denominator after substitution, we have 


15 




PUJX,.*;} 


i<l> A (X,)-P{w A l'v) 

A» I 


(45) 


Thus, cither the joint or conditional probabilities may be used in the decision rule: 


R 3 : Choose k which minimizes 

F 3iA (X,) - In | D A | + ( X/~ tn A ) 'D A 1 (X/ — m A ) — 2lnP{w A , 


(46) 


r' 3 : Choose k which minimizes 

F3,a(X,) - rit|Dj + (Xri%)'Dr l (Xrmfc)"2lnP(«A-U;}. 


(47) 


Numerical Example 

To Illustrate this use of prior probabilities contingent on an external conditioning 
variable, let us return to the two-class example discussed earlier. This time, however, let 
us assume that a stratified random sample of the area to be classified produces the esti- 
* mates of probabilities shown in Table 2. The conditioning variable vj has two states: v\ 

and v 2 . Under the conditions of v\, both classes have equal prior probabilities; under v 2 , 
the second class is more likely to appear, with the probability of .7 for o) 2 and .3 for 
Table 3 presents the calculations for this example. For v\, would be the most likely 
choice, In the case of v 2 , <o 2 is more probable. 

Adding Additional Conditioning Variables 

Logic analogous to that of the preceding section shows that classification decisions 
may be made contingent on any number of external multistate conditioning variables. 
However, a separate set of prior probabilities must be estimated tor all possible states of 
conditioning variables. For example, consider classifying natural vegetation in an area 
containing four distinctive rock types, six different soil types, and four unique topographic 
habitats. Ninety-six sets of prior probabilities will then be required. Estimating these pro- 


16 


bnbilities by separate samples would be prohibitive for such a large number of combina- 
tions. 

To alleviate this problem, It is possible to model these probabilities from a much 
smaller set under the assumption of no high-level interaction. This procedure amounts to 
the calculation of expected values for a multidimensional contingency table when only cer- 
tain marginal totals are known. Techniques for such modeling have been described in the 
recent statistical literature, and are summarized in two current books by Bishop, Fienburg 
and Holland (1975) and by Upton (1978). (Other treatments appear in Cox (1970) and 
Fienburg (1977).) The discussion below is based partly on the treatment presented in pp. 
57-101 of Bishop, et a!., and the reader is referred to these works for cases involving 
modeling beyond the trivariate case presented here. 

As a simple example, consider the three-way case in which a measurement is a 
member of class (o k and is also associated with two conditioning variables v j and o/. Then 
the Law of Conditional Probability states 






£ PU^^o/} 

k-i 


(48) 


since P{i /y,o,} = ]£P{a>^,Vy ( o/) . There are KxJxL probabilities of the form 

k~ l 

P(fc>*,i>y,o/}, and we wish to estimate these with maximum likelihood without sampling 
the full set. Such an estimate is possible, assuming no three-way interaction between oi,v, 
and o, if probabilities of the forms P{a> A .,o/), and Pit';,*)/} are known. 

The method, first described by Deming and Stephan (1940), requires iterative fitting 
of three-way probabilities P [b) k , Vj,oii to conform with observed two-way probabilities 
P [uik.vj], P(tt>£,o/}, and P{i>/,o/I. Beginning with an initial starting probability 
P 0 {to/’.,i'y,o/}, the individual three-way probabilities are first rescaled to confirm with one 


set of two-way probabilities: 


UM 

r 


* 


iiMw .«•.») 


Pi(ma»iv<> ) r •°.l* ‘ ,,i 


t|0) 


Resealing then proceeds for another set of two-way conditionals* 


E '■*.(<« .<%.<> 


” I’li'-'A.'' }• * 


(50) 


P{« ,o j 


and finally for the last set: 


/ „ 

E .1* .<» 

",,«/} «* !*>{«*.»' .<>,)' , - 

nfc>A*»'J 


(5l) 


As the procedure is repeated, convergence occurs rapidly, and values stabilize within 
a small number of iterations (Doming and Stephan, 1940). The method always converges 
toward the unique set of maximum likelihood estimates and can be used with any set of 
starting values; further, estimates may be determined to any preset level of accuracy 
(Bishop, Fienburg, and Holland, 1975, p. 83). 

In a typical remote sensing application, a stratified random sample is collected which 
estimates the tw'o conditional probability sets P{w Ji>/} and P{wJo/}, In addition, proba- 
bilities of the form Pj&^.o/} are obtained by processing registered digital images of maps 
showing the spatial distributions of v and o. By noting that 


vivj) » EPh.o/} 

/-i 


(52) 


and using the Law of Conditional Probability, 


P{fc>A,l'y) =* P{wJl'/}*P(l'y}, 
the joint probabilities P{«*,i/y) and P{<u*»o/| can east 


( 53 ) 




conditional probabilities P{foJ**,} and P{wJo/}, 


Numerical Example 

A simple numerical example will illustrate tire iterative method. Table 4 presents a 
set of one-way conditional probabilities for an example of three classes A'-=l,2,3 with 
two conditioning variables vj, 1,2,3 and o /( /— 1,2,3. Although simple decimal values 
are assumed here for ease of computation, these values would normally be obtained by 
prior random stratified sampling. Also required are the joint probabilities P(i/y,o/} (Table 
5). Tables 6 and 7 show how values for P{w^,i^} and Pfw^.o/} are calculated according 
to expressions (52) and (53). 

Table 8 presents the results of the first two iterations in fitting the no-two-way- 
interaction models to these data. Using the criterion of no further change in any 
P{w^ , i> j,o/} of greater than 10“ 6 , convergence is reached at iteration 23, Although these 
probabilities can be used directly in decision rule R 3 , it may be easier to examine the 
values as conditional probabilities as used in R'3. These values are shown in the last 
column of the table, 

Time-Sequential Classification 

If a classification carried out at a earlier time is viewed as an externa! conditioning 
variable, then the mechanism of prior probabilities can be used to make the outcome of a 
classification contingent on the earlier classification. This application is best clarified by an 
example. Consider an agricultural classification with four field types: rice, cotton, orchard, 
and fallow. An early spring classification reveals the presence of young rice with high 
accuracy, but at that time cotton cannot be distinguished from fallow fields. Orchards are 
easily distinguished from field crops at any time of year. By early summer, many fields 
which classified as fallow are likely to be in cotton; however, fields classified as rice are 


19 


fCC 


still likely to be rice, Orchards will remain unchanged in areal extent. 

Data from prior years are collected to quantify these expected changes, and a transi- 
tion probability matrix is devised which describes the changes in classification expected 
during the early spring-early summer period (Table 9). In this example, spring 
classification shows thirty percent of the observations to be rice; by summer, ninety per- 
cent of these observations are expected to continue as rice, with ten percent returning to 
fallow because of crop failure or lack of irrigation. Twenty percent of the spring observa- 
tions are orchards, and all of these arc expected to remain in orchard through early sum- 
mer. Fallow fields, constituting fifty percent of the spring observations, are most likely to 
become cotton (probability .7), with a few becoming rice, orchard, or remaining fallow 
(probabilities .1). Since no observations are classified as cotton in spring, no transition 
probabilities are needed for that class. This use of transition probabilities was suggested 
as early as 1967 by Simonett, ct al. 

The transition probability matrix can also be recognized as a matrix of conditional 
probabilities P{a)Jr' 7 -} which describe the probability that an observation will fall into sum- 
mer class cd* given that the observation falls into spring class Vj. Thus, the early summer 
classification can be made contingent on the early spring classification through the prior 
probability mechanisms discussed earlier, and any possible confusion between cotton and 
rice in summer will be resolved by the spring classification. It is also interesting to note 
that the transition probability matrix is actually a square stochastic matrix, and therefore 
the situation is equivalent to a simple, one-step Markov process. Under these conditions, 
the expected posterior probabilities P{cdJ are 

P ( CD * } <= £P{cD A .jl';}-P{|/y}. 

y- 1 


( 54 ) 


20 


¥7 


In a recent paper, Swain (1978) has carried this approach a step further, incorporat- 
ing in the decision rule both measurement vectors X,,i and X , t 2 taken from times t\ and 
/ 2 at point / in space. In contrast, the approach described above uses X, ( i to predict vj at 
time /| and then uses X />2 and vj to make the classification decision at time / 2 , Swain’s 
decision rule, in the notation of this paper, becomes 


Rjyt Choose k to maximize 

Fsn(X tl ,X„) - t P|X tl |^|.P[X t2 | m J.P(< u ,|^|.P{M y ). 
/»! 


(55) 


Swain has termed this rule the “cascade classifier.” 

Swain’s approach has the advantage of using full information about the distances of 
the measurement vectors X /( i and X /2 from class means; however, as Swain notes, there 
is no way to make the first observation set dominate the second. When the transition 
probability matrix goes to an identity matrix, the classification rule becomes: 

F W (X/,|,X A 2) - P{X,. , | I V } -lnx^lco^} , (56) 

and the two observations become equally weighted. Decision rule R 3 docs allow the first 
observation to dominate; here, an identity transition probability matrix will preserve the 
first classification completely. On the other hand, R.3 assumes that the prior classification 
is perfectly correct, and any errors in the prior classification will also be preserved to an 
extent controlled by the transition probabilities. Thus, both approaches are relevant, 
depending on the classification task at hand. 


Remote Sensing Example 

The preceding numerical examples have demonstrated the application of prior proba- 
bilities in maximum likelihood classification in a computational context; a real example 
drawn from remote sensing will hopefully serve to further understanding in an operational 


21 


context, This example (Strahlcr, Logan, and Bryant, 1978) is drawn from a problem 
involving classification of natural vegetation in a heavily forested area of northern Califor- 
nia, In the classification, spectral data arc used to define species-specific timber types, and 
elevation and slope aspect are used as collateral data channels to improve classification 
accuracy, 

The area selected for application of the classification techniques described above is 
referred to as the Doggett Creek study area, comprising about 220 sq. km. of private and 
publicly-owned forest land in northern California near the town of Klamath River. 

Located within the Siskiyou Mountains, elevations in the area range from 500 m at the 
Klamath River, which crosses the southern portion, to 2065 m near Dry Lake Lookout on 
an unnamed summit. A well developed network of logging roads and trails is present, 
providing relatively easy access to nearly all of the area by road or foot, 

A wide variety of distinctive vegetation types is present in the area. Life-form 
classes include alpine meadow, fir park, pasture, cropland, and burned, reforested areas. 
Forest vegetation includes, from high elevation to low elevation, such types as red fir, 
white fir, douglas fir-ponderosa pine-incense cedar, pine-oak, and oak-chapparal. Thus, 
the topographic and vegetational characteristics of the area are well differentiated. 

After a review of available Landsat frames which included the Doggett f’reek area, 
two were selected for analysis: July 4, 1973, and October 15, 1974. The two frames were 
obtained as computer compatible tapes from the EROS Data Center, Sioux Falls, S.D. 
and then reformatted and precision rectified to sinusoidal projections. Pixel size was con- 
verted to 80 x 80 meters in the rectification process to facilitate film writer playback. 

Using the July image as a base, the October frame was registered to within a half-pixel 
error using seven control points. In this process, the October image was resampled to 
conform with the July image using a cubic spline convolution algorithm. Figure 1 


22 


W 


presents an image of the study area using Landsat band 5 (.6 - .1 ji) from the July frame. 

Also registered to the July image was a terrain image derived from the U.S. Geologi- 
cal Survey 1:250,000 digital terrain tape for the Weed, California, quadrangle. In the 
registration process, the image was converted to 80 x 80 meter pixel size, and stretched to 
yield a full range of gray tone values. Slope and aspect images were generated directly 
from the registered elevation data using a least squares algorithm which fits a plane to 
each pixel and its four nearest neighbors. 

The slope aspect image consisted initially of gray tone densities between 0 (black) 
and 255 (white) which indicated the azimuth of slope orientation, ranging clockwise from 
0® to 359°. These values were then transformed by a cosine function proposed by Har- 
tung and Lloyd (1969). Since northeast slopes present the most favorable growing 
environment, and southwest slopes the least favorable, with northwest and southeast 
slopes of neutral character, the density tones of azimuths were rescaled with 3 represent- 
ing due southwest and 255 representing due northeast. (Values of 0, 1, and 2 were 
reserved for special codings.) Neutral slopes, oriented northwest or southeast, thus 
received density tones near 127, The function also corrected automatically for the 12° 
skew of the Landsat image. For processing as collateral data channels, both elevation and 
transformed aspect were converted to three-state variables: elevation to low, middle, and 
high; and aspect to southwest, neutral (southeast or northwest) and northeast. Figure 2 
shows elevation and aspect images as well as their three-state versions. 

Following an initial reconnaissance of the area, thirteen species-specific forest cover 
classes were selected as representing the range of cover typ* s within the study area. 

These classes were defined by a set of 93 training sites ranging in size from approximately 
twenty to one hundred pixels. Further processing revealed the presence of several sub- 
types within most of the forest cover classes. For example, open canopy douglas fir train- 


# 


23 


SI 


ing sites were divided into two subtypes. Such subtypes were also defined for hardwood, 
white fir, douglas fir, spnrce, and grass and shrub cover classes. Throughout the 
classification procedure, these subtypes were kept separate, joining together only in the 
final classification map. 

In order to obtain estimates of prior probabilities for the forest cover class types, 
one hundred points were randomly selected from a grid covering the Doggctt Creek study 
area by drawing coordinates from a random number table. At each of these points, the 
forest cover class was determined either by interpretation of 1:8,000 color aerial photogra- 
phy, or by actual field visit. Of the 100 points, IS were discarded because they fell (1) in 
locations which were inaccessable in the time available; or (2) outside the area covered by 
1:8,000 air photos (and therefore could not be accurately located on either the Landsat 
frame or on the ground). At each point, the elevation and aspect class was also recorded, 
thus allowing type counts to be cross tabulated according to elevation and aspect. 

From this sample of 85 points, three sets of probabilities were prepared. The first of 
these recorded the unconditional prior probabilities of the forest cover types — that is, 
their proportional representation within the entire study area. The second and third sets of 
probabilities aggregated the points according to elevation and aspect classes, and were used 
to estimate three sets of probabilities for each topographic parameter (low, middle, and 
high for elevation, and northeast, neutral and southwest for aspect). Table 10 shows how 
the classes were defined, and describes the number of points falling into each. 

These estimates of probabilities lack precision because the number of sample points 
is small; with 85 samples distributed across 13 cover types, the calculated probabilities are 
more likely to indicate adequately the rank order of the magnitudes rather than the true 
values of the magnitudes themselves. However, under constraints of field time and 
expense, it was not possible to prepare a larger dataset for this particular trial. Consider- 


si 


♦ 24 

ing i lie sensitivity of the classification to extreme probability values, future work should 
estimate these probability sets to a higher degree of accuracy. 

This dataset was also used to estimate classification accuracies, By recording the 
pixel location of each of the sample points, the cover type as determined on the ground 
could be compared with the cover type as classified on the Landsut image, Because of 
uncertainties in locating each pixel on the 1:8,000 air photos, it was necessary to specify 
alternate acceptable classifications for each point. For example, a pixel falling into a stand 
identified on the ground as douglas fir, open canopy might fall almost entirely on a clear- 
ing, and thus be classified as grass/shrub, or sparse if containing a few canopy trees. In 
such a case, the classification was termed correct. On the other hand, classifications such 
as hardwood, alpine meadow, or red fir forest would be an obvious error in a douglas fir 
stand. Here again, estimated accuracies are influenced by the limited size of the sample. 

Note that the field data are used to produce a classification which is then assessed 
for accuracy by reference to the same data. Separate samples would clearly be more desir- 
able. The decision to use the same set of samples to determine accuracy that was used to 
estimate the probability sets was, again, influenced by available field lime and travel funds, 
However, the accuracies are greatly dependent on the spatial location of the data points; 
only in the aggregate does each data point influence the classification. Thus, we would 
expect the accuracies to reflect only a slight positive bias produced by this cost-reducing 
strategy. 

Although several different classifications were carried out, only three arc of impor- 
tance here. The first used sped' - .’! data only, and assumed equal prior probabilities; this 
classification yielded an accuracy of 58 percent (Figure 3 ). In the second, three sets of 
prior probabilities for the forest types were used, each contingent on one of the three 
elevation states (Table 10; Fig. 4). The classification software was modified to use a table 


look-up of prior probabilities with elevation class as one index into the prior probability 
table. This technique increased classification accuracy from 58 percent to 71 percent. 

The third classification used two sets ol prior probabilities contingent on elevation 
and aspect, analogous to PfwJi',} and P(wJo/j in the preceding section, Software then 
systematically sampled the registered elevation class and terrain class images to yield the 
joint probabilities of elevation and aspect classes, analogous to P(j^,o/}. The iterative 
algorithm described earlier was then applied to estimate the set of conditional probabilities 
for forest cover classes contingent on all combinations of elevation and aspect classes. 
Classification using these estimated probabilities contingent on both elevation and aspect 
yielded an accuracy of 77 percent, an improvement over that observed for eleva’ion alone 
(Figure 5). Thus, this example demonstrates how prior probabilities can be used to 
merge continuous variables of multispectral brightness with discrete variables of elevation 
and aspect class to improve classification accuracies. 

Discussion 

As noted earlier, extreme values of prior probabilities can force a classification to be 
made essentially without information concerning the observation itself. When priors are 
equal, however, they have no effect. The classifier, then, continuously trades off the role 
of the multivariate information for the role of prior information, depending on both the 
magnitude of the distance of the multivariate observation vector from the class mean vec- 
tor and the ratios of the particular prior probabilities involved in the decision. When the 
experimental design allows the priors to be determined by external conditioning variables, 
the effect is to classify based on multivariate information when the po, sible classes are not 
particularly influenced by the conditioning variable and to classify based on prior informa- 
tion when multivariate data are equivocal or some classes are much more or less likely 
than others. Thus, in the forest classification example, terrain information served to 


26 


S') 


differentiate species-specific cover types <e.g„ red (ir, white Or), whereas spectral informa- 
tion differentiated life form classes (e,g„ grass-shrub, hardwood). 

Another important point concerns the dependence of the prior probabilities on the 
scene itself; the relative areas of the classes in the output scene must be accurately 
estimated. If the output scene shifts in area, then the priors must be changed. The 
classification cannot be extended to a new area without reestimating the prior probabilities. 
Thus, it would be appropriate to use county crop acreage values to set prior probabilities 
only when the entire area of the county, no more, no less, is to be classified, In the case 
of one or two external conditioning variables determining the appropriate set of priors, 
both the joint probabilities 1*{j , / ,o/} and the conditional probabilities l*((i» Jr,} and 
P(ci>Jo/} must truly represent the area to be classified, for, taken together, they determine 
the prior probability values actually used in computation. In some applications, it may be 
possible to extend the conditionals to a new scene in which only the joint probabilities of 
the variables change — for example, a forest cover classification with elevation and aspect 
as conditioning variables which is extended from one uniform area to another. The new 
area will have different joint probabilities P(i* /( o/} (and derived priors !*{»',) and l*{o/}) , 
but it might be reasonable to assume that the conditional probabilities are ecologically 
based and remain consistent from one area to the next. 

It should be noted that collateral information cannot be incorporated through the 
prior probability mechanism without the collection of data to estimate the priors and/or 
conditionals. If the collateral data are likely to be unrelated to the multivariate data and 
are expected to influence strongly the prior probabilities of the classes, then such estimat- 
ing costs will be justified, for significant improvements in accuracy should result, Ulti- 
mately, it is up to the user to balance the costs of acquiring such information with the 
value of the expected payoffs in accuracy which are anticipated. 


27 




The logic which culminates in decision rules R 2 and R 3 assumes that the mean vector 
and dispersion matrix for a class arc not affected by the external conditioning variable (see 
expression (35)) — in the remote sensing case, this means that the signatures are invari- 
ant. In some applications, this assumption may not be warranted. An example would be 
an agricultural crop classification with soil type as a collateral variable. Mere the signature 
of the soil itself, at least in the earlier states of crop development, will influence the crop 
signature. In this situation, there is no recourse but to spectrally characterize each combi- 
nation of crop and soil type so that probabilities of the form P{(oJX,,t' ; } can be calcu- 
lated. Following the logic of expressions (5) through (11), it is possible to show that 




PtXjo^, i; ;}■!»{«> J*y) 

A~! 


(57) 


which could be made the basis of a decision rule related to R 2 . 

A final point worthy of discussion concerns modeling of joint probabilities, suggested 
in an earlier section to reduce the need for more extensive ground sampling. The model 
presented is but one example of a large variety of techniques by which collateral data can 
be used to predict the spatial distribution of classes in an output image. Discrete and con- 
tinuous collateral variables can be merged either using empirical techniques including mul- 
tiple regression, logit analysis, discriminant analysis, analysis of covariance, and con- 
tingency table analysis, or by constructing more functional models which model the spatial 
processes actually occurring in a deterministic way. When such models are constructed 
and interfaced with remotely sensed data, the result may be extremely powerful, both for 
the ability to accurately predict a spatial pattern and for the understanding of the complex 
system which produces it. 


28 




Conclusions 

The use of prior probabilities in maximum likelihood classification allows: 

(1) the incorporation into the classification of prior knowledge 
concerning the frequencies of output classes which are expected 
in the area to be classified; 

(2) the merging of one or more discrete collateral datasets into 
the classification process through the use of multiple prior pro- 
bability sets describing the expected class distribution for each 
combination of discrete collateral variables; and 

(3) the use of time-sequential information in making the out- 
come of a later classification contingent on an earlier 
classification. 

Thus, prior probabilities can be a powerful and effective aid to improving classification 
accuracy and modeling the behavior of spatial systems. 

Acknowledgements 

The research reported herein was supported in part by NASA grant NSG-2377, NASA 
contract NAS-9- 1 5509, and the California Institute of Technology’s President’s Fund (award 
PF-123), under NASA contract NAS-7-100. I am greatly indebted to D. S. Simonett, P. H. 
Swain, R. M. Haralick, and W. Wigton for critical review of the manuscript. 

References 

Addington, J. D. (1975), VICAR Program FASTCLAS, VICAR Documentation, Image 

Processing Laboratory, Jet Propulsion Laboratory, California Institute of Technol- 
ogy, Pasadena, CA, 3 pp. 


29 


sz 


Bishop, Y. M. M., Ficnberg, S. E., and Holland, P. VV. (1975), Discrete Multivariate 
Analysis: Theory and Practice, MIT Press, Cambridge, MA, 557 pp. 

Brooner, W. G., Haralick, R. M„ and Dinstcin, 1. (1971), Spectral parameters affecting 
automated image interpretation using Bayesian probability techniques, Proc. Seventh 
hit, Sytnp. on Remote Sens, of Environ,, pp. 1929-1948. 

Chow, C, K. (1957), An optimum character recognition system using decision functions, 

I REE Trans. Electron. Computers 6: 247-254. 

Cox, D. R. (1970), The Analysis of Binary Data, Methuen & Co., Ltd., London, 142 pp. 

Doming, VV. E, and Stephan, F, F, (1940), On a least squares adjustment of a sampled 
frequency table when the expected marginal totals arc known, Ann. Math. Stalls. 

11: 42Z-444. 

Dubes, R. and Jain, A. K. (1976), Clustering techniques: The user’s dilemma, Pattern 
Recognition S : 247-260. 

Ficnberg, S. E. (1977), The Analysis of Cross-Classified Categorical Data, MIT Press, Cam- 
bridge, MA, 151 pp. 

Goodenough, D., and Shlien, S. (1974), Results of cover-type classification by maximum 
likelihood and parallelepiped methods, Proc. Second Canadian Sytnp. on Remote Sens- 
ing, 1: 136-164. 

Hartung, R. E., and Lloyd, W. J. (1969), Influence of aspect on forests of the Clarksville 
Soils In Dent County, Missouri, J. Forestry 67: 178-182. 

Kendall, M, G. (1972), in Frontiers of Pattern Recognition (Satosi Watanabe, Ed.), 
Academic Press, New York, pp. 291-307. 

Nilsson, N. J. (1965), Learning Machines — Foundations of Trainable Pattern-Classifying 
Systems, McGraw-Hill Book Co., New York. 


si 


30 

Reeves, R. G., Anson, A., and Landed, D. (1975), Manual of Remote Sensing, Amer. Soe. 
of Photogrammetry, Palls Church, VA, 2 vols., 2144 pp. 

Richardson, A. J. and VViegand, C. L. (1977), Distinguishing vegetation from soil back- 
ground information, Photogrammetric Engr. and Remote Sensing 43: 1541-1552. 

Rouse, J. W., Jr., Hass, R. H., Schell, J. A., and Deering, D. W. (1973), Monitoring 
vegetation systems in the Great Plains with ERTS, Third ERTS Symposium, NASA 
Special Publication SP-351, 1:309-317. 

Sebestyen, G. (1962), Decision-Making Processes in Pattern Recognition, Macmillan, New 
York. 

Schell, J. A. (1973), in Remote Sensing of Earth Resources, Volume I (F. Shahrokhi, Ed.), 
University of Tennessee Space Institute, Tullahoma, TN, pp. 374-394. 

Schlien, S., and Smith, A. (1975), A rapid method to generate spectral theme 
classification of Landsat imagery, Remote Sens. Environ. 4: 67-77. 

Simonett, D. S,, Eagleman, J, E., Erhart, A. B., Rhodes, D, C., and Schwarz, D. E. 

((1967), The Potential of Radar as a Remote Sensor in Agricuitu • I. A Study with K- 
Band Imagery in Western Kansas, CRES Report No. 61-21, University of Kansas, 
Lawrence, KN, 13 pp. 

Strahler, A. H., Logan, T. L., and Bryant, N. A. (1978), Improving forest cover 

classification accuracy from Landsat by incorporating topographic information, Proc. 
Twelfth hit. Symp. on Remote Sens, of Environ., pp. 927-942. 

Swain, P. H. (1978), Bayesian classification in a time-varying environment, IEEE Trans, on 
Systems, Man and Cybern., SMC-8: 879-883. 

Tatsuoka, M. M.(1971), Multivariate Analysis: Techniques for Educational and Psychological 
Research, John Wiley & Sons, New York, 310 pp. 


* *•" ********** *t*+*wm 

■SJ 

Tatsuoka, M. M. and Tiedcman, D. V. (1954), Discriminant Analysis, Rev. of Ed. Res , 
25: 402-420. 

Upton, G. J. G. (1978), The Analysis of Cross-Tabulated Data, John Wiley & Sons, New 
York, 148 pp. 


ft 


32 


Tables for Prior Probabilities 


TABLE 1 Notation 
TfRM 


1) 1 I INI I ION- 


(> 


X 

X. 

P{XJ 


»'/ 

PM 


P(l',} 


P{wJX ( ) 


<M*,) 


k-k 


Number of measurement variables used to characterize each 
object or observation. 

A //-dimensional random vector. 

Vector of measurements on /> variables associated with the / th 
object or observation; /*• 1,2, . . . ,N. 

Probability that a //-dimensional random vector X will take on 
observed values X ( . 

Member of the k th set of classes o\ k**\ ,2, , K, 

Member of the ,/th set of states for a conditioning variable »»; 

./=“!. 2 , , . . 

Probability that an observation will be a member of class w A ; 
prior probability for class <a A , 

Probability that an observation will be associated with state ./of 
conditioning variable i > /, prior probabiiiiy for state t>j. 

Probability that an observation is a member of class ou A given 
that measurement vector X, is observed, 

Probability density value associated with observation vector X, as 
evaluated for class k. 

Parametric mean vector associated with the k th class. 

Mean vector associated with a sample of observations belonging 
to the k th class; taken as an estimator of /x A . 

Parametric /> by p dispersion (variance-covariance) matrix asso- 
ciated with the k th class. 

p by p dispersion matrix associated with a sample of observa- 
tions belonging to the k th class; taken as an estimator of £ A , 


33 


<fc*’ 


TABLE 2 Simple Prior Probabilities for Numerical Example 


Conditioning Variable 

Probability 

V I v 2 

pM 

.5 .3 

p{&> 2 } 

.5 ,7 


TABLE 3 Calculation of Maximum Likelihood Posterior Probabilities 



V\ 

V 2 


0)\ 

0)2 

W| 



.0532 

.0338 

.0532 

.0338 

P{wJ*v) 

.0266 

.0169 

.0160 

.0237 


.0266 

.0169 

.0169 

.0237 

A-i 

.0435 

.0397 

pMx,,*,} 

.611 

.389 

.403 

.597 


TABLE 4 Conditional Probabilities for Numerical Example 


Pf" k > v i) 




t 0 V\ 

"3 

"3 

oi 

°2 °3 

fa>i .6 

.5 

.2 

.5 

.8 .2 

a) 2 .3 

.2 

.4 

.4 

.1 .3 

Ct>3 .1 

.3 

.4 

.1 

.1 .5 


TABLE 5 

Joint 

Probabilities for Numerical Example 

P[^y,0/} 

V 

°i 

02 

03 

rM 

"1 

.08 

.12 

.14 

.34 

"2 

.07 

.09 

.12 

.28 

"3 

.16 

.10 

.12 

.38 

P{o/} 

.31 

.31 

.38 





7 Culcu»auon of Joint Two.Wgy Probabilities for Numeric al Example 

Plo,} 




A 

,i 


irr 

.31 

.31 


TBT 


1*1^4 Icijl 


VlOy] 


,124 

*031 


.1 


x 

X 

X 


.31 

*31 

.31 


PK.0,1 


T24T 

.03! 

.03! 


P|oj| 
x ,38 
.3 x J8 

.5 X ,38 


1576 

.114 

.190 


TABLE 8 Iterat ive Fitting of No-Three* 
* Inhul ^ 


Way Interaction Mode! 


ITERATION I 


4 / / 


1 

1 

! 

2 

2 

2 

3 

3 

3 

1 

1 

! 

2 

2 

2 

3 

3 

3 

1 
1 
1 

2 
2 
2 
3 
3 
3 


1 

2 
3 

1 

2 
3 

1 

2 
3 

1 

2 
3 
1 
2 
3 

1 

2 
3 
1 
2 
3 

1 

2 
3 
1 
2 
3 


Foi^^iyo/) 


ft3K»*vo/) 


iteration 


X370 

X570 

4/370 

.C370 

.6370 

,O3?0 

,C370 

.*,‘370 

.6370 

3:370 

.C.VO 

.0370 

X'F 0 

.0370 

4/370 

X370 

.0370 

.0370 

.0370 

.0370 

.0370 

.0370 

.0370 

.0370 

.0370 

.0370 


1® 

,0680 

.06C0 

,0467 

,0467 

.0467 

.0233 

,0253 

,0253 

.0340 

.0340 

.0340 

.0187 

.0187 

.01S7 

.0507 

.0507 

,0507 

.0113 

.0113 

.0113 

.02S0 

.0280 

,0280 

,0507 

,0507 

.0507 


l)f53 

,1205 

.0369 

,0517 

,0826 

,0253 

,0280 

,0449 

.0138 

.0408 

,0102 

.0375 

,0224 

.0056 

.0206 

.0608 

,0152 

.0599 

.0039 

,0039 

.0239 

.0096 

.0096 

.059! 

.0175 

.0175 

.1070 


.0502 

,1074 

,0525 

.0432 

,0760 

,0289 

,0422 

,0579 

,0093 

.0272 

,0091 

.0534 

.01 8 7 

,005! 

,0235 

,0915 

,0196 

.0380 

,0026 

.0035 

.0341 

,0081 

.0089 

.0675 

.0263 

.0225 

.0727 


**1^ »»>*>/ ) 
45487 


,1043 

.0510 

,0408 

,0718 

,0274 

.0293 

.0402 

,0065 

,0309 

.0103 

,0607 

.0221 

,0061 

,0278 

'.0933 

,0200 

.0387 

.0022 

.0029 

,0288 

.0080 

.0088 

,0672 

.0329 

,0282 

.0910 


— ISSJT 


final 


.1196 

,0457 

.0532 

,0823 

.0245 

.0382 

,0461 

.0058 

.0262 

,0088 

.0544 

,01S7 

.0052 

.0219 

,0790 

.0170 

.0347 

,0016 

,0023 

,0293 

.0058 

,0068 

,0683 

,0236 

,0219 

,0924 


,0557 


.1098 

.0494 

,0479 

,0785 

,0250 

,0434 

*0542 

,0052 

,0230 

.0081 

,0589 

,0169 

.0049 

,0254 

,0897 

.0200 

.0113 

,0014 

,0021 

.0317 

.0052 

.0065 

,0696 

,0268 

.0257 

,0834 


45514 




,1132 

,04*6 

4)517 

,0820 

. 022 ! 

.0386 

.0504 

,0023 

,0179 

.0059 

.0613 

.0153 

.0043 

,0277 

.0931 

,0217 

,0239 

,0007 

.0009 

.029) 

,0030 

,0036 

,0072 

.0283 

,0279 

.0937 


*76 
.943 
.35 

J)c 

,91, « 
*1845 
,2413 
*5or 
,on 
. 22 - 
,04<L 
4376 
a ib? 
.048* 
>231 
,581 
,2169 
,1993 
,0083 
,0077 
,207 r 
,0430 
,040! 
.5848 
,1770 
,279, 
,7811 


TABLE 9 Agricultural Time-Sequential Classification Example 


PUt-lptl 



0>, 

(0 2 

w 3 

CO 4 

Spring Class 

p(-,I 

Rice 

Cotton 

Orchard 

Fallow 

*>l Rice 

.3 

.9 

.0 

.0 

.1 

v 2 Cotton 

.0 


••a. 



*0 Orchard 

.2 

.0 

.0 

1.0 

,0 

V4 Fallow 

.5 

.1 

.7 

.2 

.1 

PM 


.32 

.35 

.25 

.08 


ORIGINAL PAG?) M 
OP POOR QUALilT 




* . -*S .A 

•«*„ * tmm ******* .db • % * * *» *#**»**#>•»• » #-•>**» 

* m «,• m mm ; *< i**. m * • * • ..*#*•** .*-+*■ ft. T * 

TABLE 10 

35 

Elevation and Aspect Class Definitions 

Code 

Definition 

Point Count 

Elevation 



Low 

< 1067 m 

45 

Middle 

1068-1524 m 

26 

High 

>1525 m 

14 

Aspect 

Northeast 

337.6°-l 12.5° 

26 

Neutral 

122.6°-157.5°; 292.6°-337.5° 

25 

Southwest 

157.6°-292.5° 

34 




& 


36 


Figure Capttans for Prior Probabilities 

FIGURE 1. Landsat Band 5 image of Doggctt Creek study area, Klamath National 
Forest, California. 

FIGURE 2. Registered continuous and tri-level elevation (upper photos) and aspect 
(lower photos) images of Doggett Creek study area. 

FIGURE 3, Classification map based on spectnl data only; accuracy, 58 percent. 

FIGURE 4. Classification map based on spectral data, with elevation included by varying 
prior probabilities. Key to map symbols is included in Figure 3. Accuracy is 71 percent, 

FIGURE 5, Classification map using spectral, elevation, and aspect data. Key to map 
symbols is included in Figure 3. Accuracy is 77 percent. 



♦ Mwr L . F 1>1 . T >c«D KCTU(#W 

D ^ 04 :o Pxl 

Ch 


X^L Piz ID r* ot 2* 134)4* ftL £»**□**< 

jpv :rw,c ^orc:;iNr* l-m^-topy 




Figure 1 


r » 



Figure 2. (upper left) 


ORIGINAL PAGE IS 
OF POOR QUALITY 











■V -*^. r *S-®*: 


DOGGETT CREEK. CALIFDRNIh TRI-LEVEL ELEVATION IMA*GE 


□GTRIX SAT JAN g9> 1973 033331 JPL^IPL 


El 



— 



* 

• 

< • I « | • 1 1 • | | • M | t | M 1 

r 


Figure 2. (upper right) 

















IOGGETT CREEKS CALIFORNIA 
CENPASS ASPECT 

LOU PASS FILTER D F FUNCTION IMAGE 


DnG*“PEX SAT SEP 3 


Figure 2. (lower left) 


ORIGINAL PAGE S$ 


^ ^ M I ■ A I 














figure 2. (lower right) 


ORIGINAL PAGE 15 
OF POOR QUALny 








Lv 



FOREST TYPE 
CLASSIFICATION MAP 
I ROM 

MULTI-DttTE LhMIi hT hMD 
DIGITAL TERRAIN DATA 
OCTOBER 1974 


HQ-TERRAIN APPROACH 


DOGGETT CREEK VICINITY 
KLAMATH NATIONAL forest 


FOMDEPOS* HUE OPEN CANOPY 


FONDCPQSn FIf€ CLOSED CmmOPY 


POUGLm: TIP OPEN ChHOPY 


\^jj douglas fip clo.ed cmmor' 


WHITE FIF PFEN CANOPY 


FED FIP DFEN CmNOPY 
FED FIP CEDED CmNOPY 

E» mi:xd :hhLl t*ec: 


r -%A 

z. .**• - - 


PHf-Ct ✓ InSftH 


,h >-»• m HARDWOOD. 


•jnclw::ified 


pond:» white sihuoti: lime 


NOPTMi 343 DEGREES FROM TOP 


KILUMETEPS 


IWEPCITY OF CmLIFOPMIm 
CEOGPhPHY F EMOTE ->EN:iNG UMIT 
SmNTh JmPFmPm 
HMD 

JET FPQPUL . TQM LmBOPmTOPY 
F*- 'mDENn 


•TTh 



| St **« 

k . 4 ** 




n 

43 , 

BE 


• *^AL PAGF IS 
POOP QUALTJ7 


Figure 3 





Figure 4. 







• ■ 

K 

> 

£ 

• m 

to 

* ^» 


?• 

. jjjk* ^ 

/ 

; 

rxca 

rrT 5 

&u» 

g 

4*£ .4f 


•** 

8» 

1 
m 
•* 1 

’ ~ *m 4 •** 

^.*1X 

>*♦»»»> t-2a» 
■O i^P ..tit-fa. 

55.- f t 

S$4t 

3W? 

3 : 

• 

r 

0 • 

t # 

0 

n 


::r^; ^ 

! w4n*,^^’ *4 


i«p *i 











